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DETAILED ACTION 



The preliminary amendment filed 09/10/2002 has been entered. Claims 1-13 are 
pending and being examined. 



The disclosure is objected to because it contains an embedded hyperlink and/or 
other form of browser-executable code. Applicant is required to delete the embedded 
hyperlink and/or other form of browser-executable code. See MOPEP § 608.01, 



Whoever invents or discovers any new and useful process, machine, manufacture, or composition of 
matter, or any new and useful impro vement thereof, may obtain a patent therefor, subject to the 
conditions and requirements of this title. 

The following is a quotation of the first paragraph of 35 U.S.C. 1 12: 

The specification shall contain a written description of the invention, and of the manner and process of 
making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the 
art to which it pertains, or with which it is most nearly connected, to make and use the same and shall 
set forth the best mode contemplated by the inventor of carrying out his invention. 

The following is a quotation of the second paragraph of 35 U.S.C 1 12: 

The specification shall conclude with one or more claims particularly pointing out and distinctly 
claiming the subject matter which the applicant regards as his invention. 



Claims 1-13 are rejected under 35 U.S.C. 101 because the claimed invention is 
not supported by either a specific and substantial asserted utility or a well established 



5 



Specification 



10 



Claim Rejections - 35 USC§§ 101, 112 



35 US C. 101 reads as follows: 



utility. 
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The present claims are drawn to or encompass an isolated polypeptide comprising 
the amino acid sequence of SEQ ID NO: 1 10 (PRO 1753) or comprising an amino acid 
sequence having a recited % identity thereto. 

The present specification discloses a nucleotide sequence (SEQ ID NO: 109) of a 
5 native sequence PRO 1753 cDNA, wherein SEQ ID NO: 109 is a clone designated as 
"DNA68883-1691" (paragraph 0135). FIG. 1 10 shows the amino acid sequence (SEQ ID 
NO: 1 10) derived from the coding sequence of SEQ ID NO: 109 shown in FIG, 109 
(paragraph 0136). The specification discloses uses for PRO polynucleotides and 
polypeptides in general (paragraphs 0316-0360; pages 86-100). Example 18 (Tumor 
10 Versus Normal Differential Tissue Expression Distribution) discloses that DNA68883- 
1691 is more highly expressed in esophageal tumor as compared to normal esophagus 
(page 143). 

The present specification discloses that secreted proteins and membrane-bound 
proteins and receptors have widely varying activities (paragraphs 0002-0004). This 

1 5 finding establishes that secreted proteins and membrane-bound proteins and receptors 
have very diverse functions and makes it clear that classification of a protein as a secreted 
protein or a membrane-bound protein or receptor does not identify it as having a specific 
function. The specification provides no basis for concluding which, if any, of the varied 
activities of secreted proteins and membrane-bound proteins and receptors is possessed 

20 by the PRO 1753 polypeptide. There is no evidence that a skilled artisan would have 
appreciated the identification of the PRO 1753 polypeptide, without more, would have 
suggested any specific patentable utility. 
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The disclosed uses for PRO polynucleotides and polypeptides in general 
(paragraphs 0316-0360) are not specific to the PRO 1753 polypeptide. 

Although the specification discloses that DNA68883-1691 is more highly 
expressed in esophageal tumor as compared to normal esophagus (page 143), the 
5 specification provides no information regarding the absolute values of the differences in 
transcript levels and provides no information regarding level of expression, activity, or 
role of the PRO 1753 polypeptide in cancer. The art demonstrates that increased 
transcript levels do not necessarily correlate with increased polypeptide levels. See 
Haynes (U), who studied more than 80 proteins relatively homogeneous in half-life and 
10 expression level, and found no strong correlation between protein and transcript level 
For some genes, equivalent mRNA levels translated into protein abundances which 
varied more than 50-fold. Haynes concluded that the protein levels cannot be accurately 
predicted from the level of the corresponding mRNA transcript (page 1863, second 
paragraph, and Figure 1). 
1 5 Hancock (V) states that "the markers that are generated by proteomics are not 

always consistent with the markers that are generated from expression profiling" (full 
paragraph 2). 

Therefore, the art indicates that transcript levels are not always correlated with 
protein levels, 

20 Furthermore, the literature cautions researchers from drawing conclusions based 

on small changes in transcript expression levels between normal and cancerous tissue. 
For example, Hu (W) analyzed 2286 genes that showed a greater than 1-fold difference in 
mean expression level between breast cancer samples and normal samples in a 
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microarray (p. 408, middle of right column), Hu discovered that, for genes displaying a 
5-fold change or less in tumors compared to normal, there was no evidence of a 
correlation between altered gene expression and a known role in the disease. However, 
among genes with a 10-fold or more change in expression level, there was a strong and 
significant correlation between expression level and a published role in the disease (see 
discussion section). 

In addition, Wang (X) indicates that differential display is the first of many steps 
required in the discovery of a novel pharmacological target, especially given that the 
function of the factor is most likely unknown. Therefore, further action should be taken 
to characterize the functions of a particular gene of interest, including ... validation for the 
importance of the gene in disease processes. See page 279, column 2, full paragraph 1. 

Finally, one skilled in the ait recognizes that although structural similarity can 
serve to classify a protein as related to other known proteins this classification is 
insufficient to establish a function or biological significance for the protein because 
ancient duplications and rearrangements of protein-coding segments have resulted in 
complex gene family relationships. Duplications can be tandem or dispersed and can 
involve entire coding regions or modules that correspond to folded protein domains. As a 
result, gene products may acquire new specificities, altered recognition properties, or 
modified functions. Extreme proliferation of some families within an organism, perhaps 
at the expense of other families, may correspond to functional innovations during 
evolution. See Henikoff (Y), page 609, Abstract. Accordingly, one skilled in the art 
would not accept mere homology as establishing a function of protein because gene 
products may acquire new specificities, altered recognition properties, or modified 
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functions. Rather, homology complements experimental data accumulated for the 
homologous protein in understanding the homologous protein's biological role. 
Although, the presence of a protein module in a protein of interest adds potential insight 
into its function and guides experiments, insight into the biological function of a protein 
cannot be automated. However, homology can be used to guide further research. See 
Henikoff (Y), paragraph bridging pages 613-614, through page 614, paragraph bridging 
columns 1-2. 

Haynes, Hancock, Hu, Wang, and Henikoff are evidence that the specification 
fails to disclose enough information about the invention to make its usefulness 
immediately apparent to those familial 1 with the technological field of the invention. This 
countervailing evidence shows that the skilled artisan would have a legitimate basis to 
doubt the utility of the PRO 1753 polypeptide. The skilled artisan would not know if 
PRO 1753 polypeptide expression could, should, or would be upregulated, down- 
regulated, or unchanged in cancer. Therefore, the disclosure that DNA68883-1691 is 
more highly expressed in esophageal tumor as compared to normal esophagus does not 
impute a specific, substantial, and credible utility to the PRO 1753 polypeptide. Based on 
the present disclosure, one skilled in the art would be required to carry out further 
research to identify or reasonably confirm a "real world" context of use. Utilities that 
require or constitute carrying out further research to identify or reasonably confirm a 
"real world" context of use are not substantial utilities. Therefore, the increased transcript 
levels of DNA68883-1691 in esophageal tumor as compared to normal esophagus does 
not establish a substantial or real-world use for the claimed polypeptide. Thus, the 
present disclosure is simply a starting point for further research and investigation into 
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potential practical uses of the claimed polypeptides. See Brenner v. Manson, 148 

US.P.Q. 689 (Sus, Ct, 1966), wherein the court held that: 

"The basic quid pro quo contemplated by the Constitution and the 
Congress for granting a patent monopoly is the benefit derived by 
5 the public from an invention with substantial utility", "[u]nless and 

until a process is refined and developed to this point-where specific 
benefit exists in currently available form-there is insufficient 
justification for permitting an applicant to engross what may prove 
to be a broad field", and "a patent is not a hunting license", "[i]t is 
10 not a reward for the search, but compensation for its successftil 

conclusion." 

Claims 1-13 are also rejected under 35 U.S.C 1 12, first paragraph. Specifically, 
since the claimed invention is not supported by either a specific and substantial asserted 
1 5 utility or a well established utility for the reasons set forth above, one skilled in the art 
clearly would not know how to use the claimed invention. 

Claims 1-5, 1243 are rejected under 35 U.S.C. 1 12, first paragraph, as failing to 
comply with the enablement requirement The claim(s) contains subject matter which 
20 was not described in the specification in such a way as to enable one skilled in the art to 
which it pertains, or with which it is most nearly connected, to make and/or use the 
invention. 

The claims are directed to or encompass a polypeptide having at least 80% amino 
acid sequence identity to the polypeptide of SEQ ID NO: 1 10, to said polypeptide lacking 
25 its associated signal peptide, or to the extracellular domain thereof The claims are broad 
because they do not require the claimed polypeptide to be identical to the disclosed 
PRO 1753 polypeptide and because the claims have no functional limitation. 
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The first paragraph of 35 U.S.C. 1 12; that paragraph requires that scope of claims 
must bear a reasonable correlation to scope of enablement provided by specification to 
persons of ordinary skill in the art; in cases involving predictable factors, such as 
mechanical or electrical elements, a single embodiment provides broad enablement in the 
sense that, once imagined, other embodiments can be made without difficulty and their 
performance characteristics predicted by resort to known scientific laws; in cases 
involving unpredictable factors, such as most chemical reactions and physiological 
activity, scope of enablement varies inversely with degree of unpredictability of factors 
involved. 

The PRO 1753 polypeptide appeal s to be a secreted polypeptide. However, the 
present specification discloses that secreted proteins and membrane-bound proteins and 
receptors have widely varying activities (paragraphs 0002-0004). This finding 
establishes that secreted proteins and membrane-bound proteins and receptors have very 
diverse functions and makes it clear that classification of a protein as a secreted protein or 
a membrane-bound protein or receptor does not identify it as having a specific function. 
The specification provides no basis for concluding which, if any, of the varied activities 
of secreted proteins and membrane-bound proteins and receptors is possessed by the 
PRO 1753 polypeptide. There is no evidence that a skilled artisan would have appreciated 
the identification of the PRO 1753 polypeptide, without more, would have suggested any 
specific use. Therefore, the knowledge that a protein is a secreted polypeptide does not 
provide predictability about its function. 

There are no working examples of polypeptides with an amino acid sequence less 
than 100% identical to the amino acid sequence of SEQ ID NO: 1 10. The examiner is 
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aware that working examples are not required. Lack of a working example, however, is a 
factor to be considered, especially in cases involving an unpredictable and undeveloped 
art. 

The specification does not provide guidance for using polypeptides related to (i.e., 
5 80%-99% identity) but not identical to SEQ ID NO: 1 10. Specifically, the instant 

specification does not identify those amino acid residues in the amino acid sequence of 
PRO 1753 which are essential for its biological activity and structural integrity and those 
residues which are either expendable or substitutable. In the absence of this information, 
the skilled artisan is left to an unduly extensive amount of random, trial and error 

1 0 experimentation wherein a polypeptide comprising the amino acid sequence of SEQ ID 
NO: 1 10 is randomly mutated and randomly assayed for a useful activity. Further, there 
does not appear to be a functionally and structurally analogous protein which has been 
identified in the prior art for which this information is known and could be extrapolated 
to the PRO 1753 polypeptide by analogy. In any case, while a specification need not 

1 5 disclose what is well known in the art, that rule does not excuse an applicant from 

providing a complete disclosure. It is the specification, not the knowledge of one skilled 
in the art, that must supply the novel aspects of an invention in order to constitute 
adequate enablement. Based on the teachings of the present specification, the skilled 
artisan would not know how to use such non-identical polypeptides absent undue 

20 experimentation. To practice the instant invention in a manner consistent with the 

breadth of the claims would not require just a repetition of work that is described in the 
instant application but a substantial inventive contribution on the part of a practitioner 
which would involve the determination of those amino acid residues in the amino acid 
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sequence of SEQ ID NO: 1 10 which are required for the functional and structural 
integrity of the PR01753 polypeptide. It is this additional characterization of that single 
disclosed, naturally occurring protein that constitutes undue experimentation. 

For these reasons, which include the complexity and unpredictability of the nature 
of the invention and art in terms of the diversity of secreted proteins and membrane- 
bound proteins and receptors and lack of knowledge about function(s) associated with the 
PRO 1753 polypeptide and its variants, the lack of working examples, the lack of 
direction or guidance for using polypeptides that are not identical to SEQ ID NO: 1 10, 
and the breadth of the claims for structure without function, it would require undue 
experimentation to use the invention commensurate in scope with the claims. 

Claims 1-5, 12, 13 are rejected under 35 U.S.C. 1 12, first paragraph, as failing to 
comply with the written description requirement. The claim(s) contains subject matter 
which was not described in the specification in such a way as to reasonably convey to one 
skilled in the relevant art that the inventor(s), at the time the application was filed, had 
possession of the claimed invention. 

The claims are drawn to or encompass a polypeptide having at least 80%, 85%, 
90%, 95% or 99% sequence identity with a SEQ ID NO: 1 10, to said SEQ ID NO: 
lacking its associated signal peptide, or to the extracellular domain of said SEQ ID NO:. 
The claims do not require that the polypeptide possess any particular biological activity, 
nor any particular conserved structure, or other disclosed distinguishing feature. Thus, 
the claims are drawn to a genus of polypeptides that is defined only by sequence identity. 
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To provide adequate written description and evidence of possession of a claimed 
genus, the specification must provide sufficient distinguishing identifying characteristics 
of the genus. The factors to be considered include disclosure of compete or partial 
structure, physical and/or chemical properties, functional characteristics, 
5 structure/function correlation, methods of making the claimed product, or any 

combination thereof. In this case, the only factor present in the claim is a partial structure 
in the form of a recitation of percent identity. There is not even identification of any 
particular portion of the structure that must be conserved. Accordingly, in the absence of 
sufficient recitation of distinguishing identifying characteristics, the specification does 

10 not provide adequate written description of the claimed genus. 

Vas-Cath Inc. v. Mahurkar, 19USPQ2d 1111, clearly states "applicant must 
convey with reasonable clarity to those skilled in the art that, as of the filing date sought, 
he or she was in possession of the invention. The invention is, for purposes of the 
'written description' inquiry, whatever is now claimed." (See page 1117.) The 

15 specification does not "clearly allow persons of ordinary skill in the art to recognize that 
[he or she] invented what is claimed" (See Vas-Cath at page 1116). As discussed above, 
the skilled artisan cannot envision the detailed chemical structure of the encompassed 
genus of polypeptides, and therefore conception is not achieved until reduction to 
practice has occurred, regardless of the complexity or simplicity of the method of 

20 isolation. Adequate written description requires more than a mere statement that it is part 
of the invention and reference to a potential method of isolating it. The compound itself 
is required. See Fiers v. Revel, 25 USPQ2d 1601 at 1606 (CAFC 1993) and Amgen Inc. 
v. Chugai Pharmaceutical Co. Ltd., 18 USPQ2d 1016. 
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One cannot describe what one has not conceived. See Fiddes v. Baird, 30 
USPQ2d 1481 at 1483. In Fiddes, claims directed to mammalian FGF's were found to be 
unpatentable due to lack of written description for that broad class. The specification 
provided only the bovine sequence, 
5 Therefore, only isolated polypeptides comprising the amino acid sequence set 

forth in SEQ ID NO: 1 10, but not the full breadth of the claim meets the written 
description provision of 35 U.S.C. §112, first paragraph. Applicant is reminded that Vas- 
Cath makes clear that the written description provision of 35 U.S.C. §112 is severable 
from its enablement provision (see page 1115). 

10 

Claims 1-6, 9, 10, 12, 13 are rejected under 35 U.S.C. 112, second paragraph, as 
being indefinite for failing to particularly point out and distinctly claim the subject matter 
which applicant regards as the invention. 

The PRO 1753 polypeptide is disclosed as a soluble or secreted protein, and is not 

1 5 disclosed as being expressed on a cell surface. Accordingly, the limitation "extracellular 
domain" is indefinite, as the art does not recognize soluble or secreted proteins as having 
such domains. Further, if the protein had an extracellular domain, the recitation of "the 
extracellular domain . . . lacking its associated signal sequence" is indefinite as a signal 
sequence is not generally considered to be part of an extracellular domain, as signal 

20 sequences are cleaved from said domains in the process of secretion from the cell The 
metes and bounds are not clearly set forth. 
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Conclusion 



No claims are allowable. 

Any inquiry concerning this communication or earlier communications from the examiner should be 
directed to david s. romeo whose telephone number is (571) 272-0890. the examiner can normally be reached on 
5 Monday through Friday from 7:30 a.m. to 4:00 p.m. If attempts to reach the examiner by telephone are 

UNSUCCESSFUL, THE EXAMINER'S SUPERVISOR, BRENDA BRUMBACK, CAN BE REACHED ON (571)272-0961 . 

if submitting official correspondence by fax, applicants are encouraged to submit official 
correspondence to the following tc 1600 before and after final rlghtfax numbers: 
Before Final (703) 872-9306 
1 0 AFTER Final (703) 872-9307 

Customers are also advised to use Certificate of Facsimile procedures when submitting a reply to a 

NON-FINAL OR FINAL OFFICE ACTION BY FACSIMILE (SEE 37 CFR 1 .6 AND 1 .8). 

Faxed draft or informal communications should be directed to the examiner at (571 ) 273-0890. 
Any inquiry of a general nature or relating to the status of this application or proceeding should be 
15 directed to the Group receptionist whose telephone number is (703) 308-0196. 



20 




David Romeo 
Primary Examiner 
Art Unit 1647 
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sLt^p H Gy n g T Proteome analysis: Biological assay or data archive? 
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Ruedi Aebersold J* thls ™ view we examine the current state of proteome analysis TTiere are 

three main issues discussed; why it is necessary to study proteomes; how oro 
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Washington, Seattle, WA, USA £ SIS is e f e * t,aI t00 ' ln understanding of regulated biological systems 

Current technology, while still mostly limited to the more abundant pSS 
enables the use of proteome analysis both to establish databases of proteiS' 
present, and to perform biological assays involving measurement of multiolo 
variables We believe that the utility of proteome analysis in fttoe bSB 
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f ° ntCntS . resolution two-dimensional gel electrophoresis (2-DB) 
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2 Rationale for proteome analysis 1862 se< l ue ^« > ' The ease, sensitivity and speed with which gel- 

2.1 Correlation between mRNA and protein separated proteins can be identified by the use of recently 
expression levels 1553 developed mass spectrometry techniques have dramati- 

2.2 Proteins are dynamically modified and pro- c * lly increased the interest in proteome technology One 

- • . . 1 863 °f ih t mos t attractive features of such analyses is that com- 

23 Proteomes are dynamic and reflect the plex biological systems can potentially be studied in their 

state of a biological system 1863 entifet n rather than as a multitude of individual compo- 
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3.1 Technical requirements of proteome tech- g . ene P foducts In cells. Large-scale proteome characteriza- 
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common implementation of proteome anal- jects curreQt *y in progress include, for example: Sdccharo* 
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3.4 Assessment of 2-DE-MS proteome tech- isms * nc,ude ^ose for: human bladder squamous cell 
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4 Utility of proteome analysis for biological human keratinocytes [12], human fibroblasts [12], mouse 
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4.1 The proteome as a database : 1868 ticaIly assess tne conc ept of proteome analysis and the 

4.2 The proteome as a biological assay ... 1868 technical feasibility of establishing complete proteome 

5 Concluding remarks . . t , i 8 7o and di $cuss ways in which proteome analysis and 
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protein analysis technology. Given the long-standing 
paradigm in biology that DNA synthesizes RNA which 
synthesized protein, and the ability to rapidly establish 
global, quantitative mRNA expression maps, the ques- 
tions which arise are why technically complex proteome 
projects should be undertaken and what specific types of 
information could be expected from proteome projects 
which cannot be obtained from genomic and transcript 
profiling projects, We see three main reasons for pro- 
teome analysis to become an essential component in the 
comprehensive analysis of biological systems, (i) Protein 
expression levels are not predictable from the mRNA 
expression levels, (ii) proteins are dynamically modified 
and processed in ways which are not necessarily 
apparent from the gene sequence, and (Hi) proteomes 
are dynamic and reflect the state of a biological system. 

2A Correlation between mRNA and protein expression 
levels 

Interpretations of quantitative mRNA expression profiles 
frequently implicitly or explicitly assume that for specific 
genes the transcript levels are indicative of the levels of 
protein expression. As part of an ongoing study in our 
laboratory, we have determined the correlation of expres- 
sion at the mRNA and protein levels for a population of 
selected genes in the yeast Saccharomyces cerevisiae 
growing at mid-log phase (S. P. Oygi et al. f submitted for 
publication), mRNA expression levels were calculated 
from published SAGE frequency tables [22]. Protein 
expression levels were quantified by metabolic radiola- 
baling of the yeast proteins, liquid scintillation counting 
of the protein spots separated by high resolution 2-DE 
and mass spectrometry identification of the protein(s) 
migrating to each spot. The selected 80 samples consti- 
tute a relatively homogeneous group with respect to pre- 
dicted half-life and expression level of the protein pro- 
ducts. Thus far, we have found a general trend but no 
strong correlation between protein and transcript levels 
(Fig. 1). For some genes studied equivalent mRNA trans- 
cript levels translated into protein abundances which 
varied by more than 50-fold. Similarly, equivalent steady- 
state protein expression levels were maintained by trans- 
cript levels varying by as much as 40-fold (S. P. Gygi 
et aL, submitted). These results suggests that even for a 
population of genes predicted to be relatively homoge- 
neous with respect to protein half-life and gene expres- 
sion, the protein levels cannot be accurately predicted 
from the level of the corresponding mRNA transcript. 

2.2 Proteins are dynamically modified and processed 

In the mature, biologically active form many proteins are 
post-translationally modified by glycosylation, phosphor- 
ylation, prenylation, acylation, ubiquitination or one or 
more of many other modifications [23] and many pro- 
teins are only functional if specifically associated or com- 
plexed with other molecules, including DNA, RNA, pro- 
teins and organic and inorganic cofactors. Frequently, 
modifications are dynamic and reversible and may alter 
the precise three-dimensional structure and the state of 
activity of a protein. Collectively, the state of modifica- 
tion of the proteins which constitute a biological system 
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Figure L Correlation between mRNA and protein levels In yeast cells. 
For a selected population of 80 genes, protein levels were measured 
by M -S-radiolabelIng and mRNA levels were calculated from publi* 
shed SAGE tables. Inset; expanded view of the tow abundance region. 
For more experimental details, also see Figs. 5 and 6 t (S. P. Oygi et ai, 
submitted). 



are important indicators for the state of the system. The 
type of protein modification and the sites modified at a 
specific cellular state can usually not be determined 
from the gene sequence alone. 

2.3 Proteomes are dynamic and reflect the state of a 
biological system 

A single genome can give rise to many qualitatively and 
quantitatively different proteomes. Specific stages of the 
cell cycle and states of differentiation, responses to 
growth and nutrient conditions, temperature and stress, 
and pathological conditions represent cellular states 
which are characterized by significantly 'different pro- 
teomes. The proteome, in principle, also reflects events 
that are under translational and post-translational con- 
trol. It is therefore expected that proteomics will be able 
to provide the most precise and detailed molecular des- 
cription of the state of a cell or tissue, provided that the 
external conditions defining the state are carefully deter- 
mined. In answer to the question of whether the study 
of proteomes is necessary for the analysis of biomolec- 
ular systems, it is evident that the analysis of mature pro- 
tein products in cells is essential as there are numerous 
levels of control of protein synthesis; degradation,, 
processing and modification, which arc only apparent by 
direct protein analysis. 



3 Description and assessment of current proteome 
analysis technology 

3.1 Technical requirements of proteome technology 

In biological systems the level of expression as well as 
the states of modification, processing and macro-molec- 
ular association of proteins are controlled and modu- 
lated depending on the state of the system. Comprehen- 
sive analysis of the identity, quantity and state of modifi- 
cation of proteins therefore requires the detection and 
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quantitation of the proteins which constitute the system, 
and analysis of differentially processed forms, There are 
a number of inherent difficulties in protein analysis 
which complicate these tasks. First, proteins cannot be 
amplified, It is possible to produce large amounts of a 
particular protein by over-expression in specific ceil sys- 
tems. However, since many proteins are dynamically 
post-translationally modified, they cannot be easily am- 
ptifled in the form in which they finally function in the 
biological system. It is frequently difficult to purify from 
the native source sufficient amounts of a protein for 
analysis. From a technological point of view this trans- 
lates into the need for high sensitivity analytical tech- 
niques. Second, many proteins are modified and pro- 
cessed post-translatlonally, Therefore, in addition to the 
protein identity, the structural basis for differentially 
modified isoforms also needs to be determined. The dis- 
tribution of a constant amount of protein over several 
differentially modified isoforms further reduces the 
amount of each species, available for analysis. The com- 
plexity and dynamics of post-tTanslational protein edit- 
ing thus significantly complicates proteome studies. 
Third, proteins vary dramatically with respect to their 
solubility in commonly used solvents. There are few, if 
any, solvent conditions in which all proteins are soluble 
and which are also compatible with protein analysis. This 
makes the development of protein purification methods 
particularly difficult since both protein purification and 
solubility have to be achieved under the same condi- 
tions. Detergents, in particular sodium dodecyl sulfate 
(SDS), are frequently added to aqueous solvents to 
maintain protein solubility. The compatibility with SDS 
is a big advantage of SDS polyacrylamide gel electro- 
phoresis (SDS-PAGB) over other protein separation 
"techniques. Thus, SDS-PAGE and two-dimensional gel 
electrophoresis, which also uses SDS and other deter- 
gents, are the most general and preferred methods for 
the purification of small amounts of proteins, provided 
that activity does not necessarily need to be maintained. 
Lastly, the number of proteins in a given cell system is 
typically in the thousands. Any attempt to identify and 
categorize all of these must use methods which are as 
rapid as possible to allow completion of the project 
within a reasonable time frame. Therefore, a successful, 
general proteomics technology requires high sensitivity, 
high throughput, the ability to differentiate differentially 
modified proteins, and the ability to quantitatively dis- 
play and analyze all the proteins present in a sample. 

3,2 2-D electrophoresis - mass spectrometry; a common 
Implementation of proteome analysis 

The most common currently used implementation of 
proteome analysis technology is based on the separation 
of proteins by two-dimensional (IEF/SDS-PAGE) gel 
electrophoresis and their subsequent identification and 
analysis by mass spectrometry (MS) or tandem mass 
spectrometry (MS/MS). In proteins are first separ- 
ated by isoelectric focusing (IEF) and then by SDS- 
PAGE, in the second, perpendicular dimension. Separ- 
ated proteins are visualized at high sensitivity by staining 
or autoradiography, producing two-dimensional arrays of 
proteins. 2-DE gels are, at present, the most commonly 
used means of global display of proteins in complex 



samples. The separation of thousands of proteins has 
been achieved in a single gel [24, 25) and differentially 
modified proteins are frequently separated. Due to the 
compatibility of 2-DE with high concentrations of deter- 
gents, protein denaturants and other additives promoting 
protein solubility, the technique is widely used. 

The second step of this type of proteome analysis Is the 
identification and analysis of separated proteins. Individ- 
ual proteins from polyacrylamide gels have traditionally 
been identified using //-terminal sequencing 126, 27), 
internal peptide sequencing (28, 29), Immunoblotting or 
comigration with known proteins 130). The recent dra- 
matic growth of large-scale . genomic and expressed 
sequence tag (EST) sequence databases has resulted iryi 
fundamental change in the way proteins are identified I y 
their amino acid sequence. Rather than by the traditional 
methods described above, protein sequences are now fre- 
quently determined by correlating mass spectral or 
tandem mass spectral data of peptides derived from pro* 
teins, with the information contained in sequence data- 
bases 131-331. 

There are a number of alternative approaches to pro- 
teome analysis currently under development, There is 
considerable interest in developing a proteome analysis 
stragegy which bypasses 2,-DE altogether, because it is . 
considered a relatively slow and tedious process, and 
because of perceived difficulties in extracting proteins 
from the gel matrix for analysis. However, 2-DE as a 
starting poirit for proteome analysis has many advan- 
tages compared to other techniques available today. The 
most significant strengths of the 2-DB-MS approach 
include the relatively uniform behavior of proteins in 
gels, the ability to quantify spots and the high resolution 
and simultaneous display of hundreds to thousands of 
proteins within a reasonable time frame, 

A schematic diagram of a typical procedure of the identi- 
fication of gel-separated proteins is shown in Fig* 2, Pro- 
tein spots detected in the gel are enzymatically or chemi- 
cally fragmented and the peptide fragments are isolated 
for analysis, as already indicated, most frequently by MS 
or MS/MS. There are numerous protocols for the gener- 
ation of peptide fragments from ^el-separated proteins. 
They can be grouped into two categories, digestion in 
the gel slice [28, 34] or digestion after electro transfer out 
of the gel onto a suitable membrane ([29, 35-37] arid 
reviewed in [38]). In most instances either technique is 
applicable and yields good results. The analysis of MS or 
'MS/MS data is an important step in the whole process 
because MS instruments can generate an enormous 
amount of information which cannot easily be managed 
manually. Recently, a number of groups have developed 
software systems dedicated to the use of peptide MS 
and MS/MS spectra for the identification of proteins. 
Proteins are identified by correlating the information 
contained h the MS spectra of protein digests or 
MS /MS spectra of individual peptides with data con- 
tained in DNA or protein sequence databases. 

The systems we are currently using in our laboratory are 
based on the separation of the peptides contained in pro- 
tein digests by narrow bore or capillary liquid chromatog- 
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separate proteins protein 




MS spectrum 
database search 



MS/MS ispectrum 
database search 



Figure 2. Schematic diagram of a procedure for identification of gel- 
separated proteins. Peptides can either be separated by a technique 
sued as LC or CE, or infused as a mixture and sorted in the MS. Data- 
base searching can either be performed on peptide masses from an 
MS spectrum, peptide fragment masses from CID spectra of peptides, 
or a combination of both. 



raphy (39, 40] or capillary electrophoresis [41], the anal- 
ysis of the separated peptides by electrospray ioniza- 
tion (ESI) MS/MS, and the correlation of the generated 
peptide spectra with sequence databases using the 
. SBQUEST program developed at the University of Wash- 
ington [32, 33]. The system automatically performs the 
following operations: a particular peptide ion character- 
ized by its mass-to-charge ratio is selected in the MS out 
of all the peptide ions present in the system at a parti* 
cular time; the selected peptide ion is collided in a colli- 
sion cell with argon (collision-induced dissociation, 
CID) and the masses of the resulting fragment ions are 
determined in the second sector of the tandem MS; this 
experimentally determined CID spectrum is then corre- 
lated with the CID spectra predicted from all the pep- 
tides in a sequence database which have essentially the 
same mass as the peptide selected for CID; this correla- 
tion matches the isolated peptide with a sequence seg- 
ment in a database and thus identifies the protein from 
which the peptide was derived. There are a number of 
alternative programs which use peptide CID spectra for 
protein identification, but we use the SEQUBST system 
because it is currently the most highly automated pro- 
gram and has proven to be successful, versatile and 
robust. 



3.3 Protein identification by LC-MS/MS, capillary 
LC-MS/MS and CE-MS/MS 

It has been demonstrated repeatedly that MS has a very 
high intrinsic sensitivity For the routine analysis of gel- 
separated proteins at high sensitivity, the most signif- 
icant challenge is the handling of small amounts of 
sample. The crux of the problem is the extraction and 
transferal of peptide mixtures generated by the digestion 
of low nanogram amounts of pfotein, from gels into the 
MS/MS system without significant loss of sample or 
introduction of unwanted contaminants. We employ 
three different systems for introducing gei-purificd sam- 
ples into an MS, depending on the level of sensitivity 
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required. As an approximate guideline, for samples con- 
taining tens of picomoles of peptides, LC-MS/MS is 
most appropriate; for samples containing low picomoie 
amounts to high femtomole amounts we use capillar/ 
LC-MS/MS} and for samples containing femtomoles or 
less, CE-MS/MS is the method of choice* 



3*3.1 LC-MS/MS 

The coupling of an MS to an HPLC system using a 
0.5 mm diameter or bigger reverse phase (RP) column 
has been described in detail [42]. This system has several 
advantages if a large number of samples are to be ana- 
lyzed and all are available in sufficient quantity. The 
LC-MS and database searching program can be run in a 
fully automated mode using an autosampler, thus maxi- 
mizing sample throughput and minimizing the need for 
operator interference. The relatively large column is 
tolerant of high levels of impurities from either gel prep- 
aration or sample matrix. Lastly* if configured with a 
flow-splitter and micro-sprayer 140), analyses can be per- 
formed on a small fraction of the sample (less than 5%) 
while the remainder of the- sample is recovered in very- 
pure solvents. This latter feature is particularly useful 
when an orthogonal technique is also used to analyze 
peptide fractions, Such as scintillation of an introduced 
radiolabel, and this data can be correlated with peptides 
identified by CID spectra. 

3,3.2 Capillary LC-MS 

An increase of sensitivity of approximately tenfold can be 
achieved by using a capillary LC system with a 100 urn ID 
column rather than a 0.5 mm ID column as referred to 
above. Since very low flow rates are required for such 
columns, most reports have used a prccolumn flow split- 
ting system for producing solvent gradients. We have 
recently desribed the design and construction of a novel 
gradient mixing system which enables . the formation 
of reproducible gradients at very low flow rates (low 
nL/min) without the need for flow splitting (A, Ducret 
et ah, submitted for publication). Using this capillary 
LC-MS/MS system we were able to identify gel-separat- 
ed proteins if low picomole to high femtomole amounts 
were loaded onto the gel [40]. This system is as yet not 
automated and, like all capillary LC systems, is prone to 
blockage of the columns by microparticulates when ana- 
lyzing gel-separated proteins. 

3.33 CE-MS/MS 

The highest level of sensitivity for analyzing gel-sep- 
arated proteins can be achieved by using capillary elec- 
trophoresis - mass spectrometry (CE-MS). We have de- 
scribed in the past a solid-phase extraction capillary elec- 
trophoresis (SPE-CE) system which was used with triple 
quadrupole and ion trap ESI-MS/MS systems for the 
identification of proteins at the low femtomole to sub- 
femtomole sensitivity level [43, 44]. While this system is 
highly sensitive, its operation is labor-intensive and its 
operation has not been automated. In order to devise an 
analytical system with both the sensitivity of a CE and 
the level of automation of LC, we have constructed 
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Figure 3. Schematic illustration of a 
microfabricated analytical system Tor CE, 
consisting of a micromachined device, 
coated capillary electroosmotic pump, 
and microelectrospray interface. The 
dimensions of the channels and reservoir 
are as indicated in the text. The channels 
on the device were graphically enhanced 
to make them more visible. Reproduced 
from (45), with permission. 



microfabricated devices for the introduction of samples 
into ESI-MS for high-sensitivity peptide analysis. 

The basic device is a piece of glass into which channels 
of 10-30 \im in depth and 50-70 nm in diameter are 
etched by using photolithography/etching techniques 
similar to the ones used in the semiconductor industry. 
(A simple device is shown in Fig, 3), The channels are 
connected to an external high voltage power supply [45]. 
Samples are manipulated on the device and off the 
device to the MS by applying different potentials to the 
reservoirs. This creates a solvent flow by electroosmotic 
pumping which can be redirected by changing the posi- 
tion of the electrode. Therefore* without the need for 
valves or gates and without any external pumping, the 
flow can be redirected by simply switching the position 
of the electrodes on the device. The direction and rate of 
the flow can be modulated by the size and the polarity 
of the electric field applied and also by the charge state 
of the surface. 

The type of data generated by the system is illustrated in 
Fig. 4, which shows the mass spectrum of a peptide sample 
representing the tryptic digest of carbonic anhydrase at 
290 fmol/iiL. Each numbered peak indicates a peptide suc- 
cessfully identified as being derived from carbonic an- 



hydrase. Some of the unassigned signals may be chemical 
or peptide contaminants. The MS is programmed to auto- 
matically select each peak and subject the peptide to CID. 
The resulting CID spectra ar6 then used to identify the 
protein by correlation with sequence databases. Therefore, 
this system allows us to concurrently apply a number of 
protein digests onto the device, to sequentially mobilize 
the samples, to automatically generate CID spectra of 
selected peptide ions and to search sequence databases 
for protein identification. These steps are performed auto- 
matically without the need for user input and proteins can 
be identified at very low femtomole level sensitivity at a 
rate of approximately one protein per 15 min. . 

3,4 Assessment of 2-DE-MS proteome technology 

Using a combination of the analytical techniques de- 
scribed above we have identified the 80 protein spots 
indicated in Fig. 5. The protein pattern was generated by 
separating a total of 40 microgram of protein contained 
in a total cell lysatc of the yeast strain YPH499 by high 
resolution 2 -DE and silver staining of the separated pro- 
teins. Tb estimate how far this type of proteome analysis 
can penetrate towards the identification of low abun- 
dance proteins, we have calculated the codon bias of the 
genes encoding the respective proteins. Codon bias is a 
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Figure 4. MS spectrum of a tryptic digest 
of carbonic anhy<lrase using the microfa- 
bricated system shown in Fig. 3. 290 
fmol/ML of parbonic anhydrase tryptic 
digest was infused into a Finnigan LCQ 
ion trap MS. Each peak was selected for . 
CID, and those which were identified as 
containing peptides derived from car- 1 
bonic anhydrase are numbered. Repro- 
duced from |45), with permission. 
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/tyur* 5. 2-DE separatioo of a lysatc of yeast cells, with Identified proteins highlighted. The first dimension of separation was an IPG from 
pH and the second dimension was a 10%T SDS-PAGB gel. Proteins were visualized by silver staining. Further details of experimental 
procedures are included in S. P. Oygl et at. (submitted). 



calculated measure of the degree of redundancy of trip- 
. let DNA codons used to produce each amino acid in a 
particular gene sequence. It has been shown to be a 
useful indicator of the level of the protein product of a 
particular gene sequence present in a cell [46]. The gen- 
eral rule which applies is that the higher the value of the 
codon bias calculated for a gene, the more abundant the 
protein product of that gene becomes. The calculated 
codon bias values corresponding to the proteins identi- 
fied in Fig. 5 are shown in Fig. 6b. Nearly all of the pro- 
teins identified Q> 95%) have codon bias values of > 0.2, 
indicating they are highly abundant in cells. In contrast, 
codon bias values calculated for the entire yeast genome 
(Fig. 6a) show that the majority of proteins present in 
the proteome have a codon bias of < 0.2 and are thus of 
low abundance. 

This finding is of considerable importance in our assess- 
ment of the current status of proteome analysis technol- 
ogy. It is clear that even using highly sensitive analytical 
techniques, we are only able to visualize and identify the 



more abundant proteins. Since many important regula- 
tory proteins are present only at low abundance, these 
would not be amenable to analysis using such tech- 
niques. This situation would be exacerbated in the anal- 
ysis of proteomes containing many more proteins than 
the approximately 6000 gene products' present in yeast 
cells [16], In the analysis of, for example, the proteome 
of any human cells, there are potentially 50000-100 000 
gene products [47]. Inherent limitations on the amount 
of protein that can be loaded on 2-DE, and the number 
of components that can be resolved, indicate that only 
the most highly* abundant fraction of the many gene 
products could be successfully analyzed. One approach 
that has been employed to circumvent these limitations 
is the use of very narrow range immobilized pH gradient 
strips for the first^dimcnsion separation of 2-DE [48], 
Since only those proteins which focus within the narrow 
range will enter the second dimension of separation, a 
much higher sample loading within the desired range is 
possible. This, in turn, can lead to the visualization and 
identification of less abundant proteins. 
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figii« tf. Calculated codon bias values for yeast proteins. (A) Distribu- 
tion of calculated values for the entire yeast proteome. (B) Distribu- 
tion of calculated values for the subset of 80 identified proteins also 
shown in Figs, t and 5. Further details of experimental procedures are 
included In S. P. Gygi tt aL (submitted). 



4 Utility of proteome analysis for biological 
research 

For the success of proteomics as a mainstream approach 
to the analysis of biological systems it is essential to 
define how proteome analysis and biological research 
projects intersect. Without a clear plan for the implemen- 
tation of proteome-type approaches into biological re- 
search projects the full impact of the technology can not 
be realized. The literature indicates that proteome anal- 
ysis is used both as a database/data archive, and as a bio- 
logical assay or biological research tool 

4.1 The proteome as a database 

The use of proteomics as a database or data archive 
essentially entails an attempt to identify all the proteins 
in a cell or species and to annotate each protein with the 
known biological information that is relevant for each 
protein. The level of annotation can, of course, be exten- 
sive. The most common implementation of this idea is 
the separation of proteins :by high resolution 2-DB, the 
identification of each detected protein spot and ' the 
annotation of the protein spots in a 2-DE gel database 
format, Tlits approach is complicated by the fact that it is 
difficult to precisely define a proteome and to decide 
which proteome should be represented in the database. 
In contrast to the genome of a species, which is essen- 
tially static, the proteome is highly dynamic. Processes 
such as differentiation, cell activation and disease can all 
significantly change the proteome of a species. This is 
illustrated in Fig. 7. The figure shows two high-resolu- 



tion 2-DE maps of proteins isolated from rat serum. 
Fig. 7A is from the serum of normal rats, while Fig. 7B 
is from the serum of rats in acute-phase serum after 
prior treatment .with an inflammation-causing agent [49]. 
It is obvious that the protein patterns are significantly 
different in several areas^ raising the question of exactly 
which proteome is being described. 

Therefore, a comprehensive proteome database of a spe- 
cies or cell type needs to contain all of the parameters 
which describe the state and the type of the cells from 
which the proteins were extracted as well as the software 
tools to search the database with queries which reflect 
the dynamics of biological systems. A comprehensive 
proteome database should be capable of quantitatively 
describing the fate of each protein if specific systemE 
and pathways are activated In the cell. Specifically, the 
quantity, the degree of modification, the subcellular loca- 
tion and the nature of molecules specifically interacting 
with a protein as well as the rate of change of these 
variables should be described. Using these admittedly 
stringent criteria, there is currently no comlete proteome 
database, A number of such databases are, however, in 
the process of being constructed. The most advanced 
among them, in our opinion, are the yeast protein, data- 
base YPD (501 (accessible at http://www,ypd.com) and 
the human 2D-PAGE databases of the Danish Centre, 
for Human Genome Research [12J (accessible at http:// 
biobase.dk/cgi-bin/ceiis). While neither can be con- 
sidered complete as not all of the potential gene pro- 
ducts are identified, both contain extensive annotation 
of supplemental information for many of the spots 
which are positively identified in reference samples. 

4.2 Hie proteome as a biological assay 

The use of proteome analysis as a biological assay or 
research tool represents an alternative approach to Inte- 
grating biology with proteomics. To investigate the state 
of a system, samples are subjected to a specific proceess 
that allows the quantitative or qualitative measurement 
of some of the variables which describe the system. In 
typical biochemical assays one variable (e.g., enzyme 
activity) of a single component (e.g. t a particular en- 
zyme) is measured. Using proteomics as an assay, mul- 
tiple variables (e.g., expression level, rate of synthesis, 
phosphorylation state, etc.) are measured concurrently 
on many (ideally all) of the proteins in a sample. The 
use of proteomics as an assay is a less Tar-reaching prop- 
osition than the construction of a comprehensive pro- 
tcome database, It does, however, represent a pragmatic 
approach which can be adapted to investigate specific 
systems and pathways, as long as the interpretation of 
the results takes into account that with current technol- 
ogy not all of the variables which describe the system 
can be observed (see Section 3.4). 

A common implementation of proteome analysis as a 
biological assay is when a 2-DE protein pattern getter- . 
ated from the analysis of an experimental sample is 
compared to an array of reference patterns representing 
different states of the system, under investigation. The 
state of the experimental system at the time the sample 
was generated is therefore determined by the quantita- 
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live comparative analysis of hundreds to a few thousand 
proteins. Comparative analysis of the 2-DE patterns fur- 
thermore highlights quantitative and qualitative differ- 
ences In the protein profiles which correlate with the 
state of the system. For this type of analysis it is not 
essential that all. the proteins are identified or even visu- 
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aljzed, although the results become more informative as 
more proteins are compared. It is obvious, however, that 
the possibility to identify any protein deemed character- 
istic for a particular state dramatically enhances this 
approach by opening up new avenues for experimenta- 
tion. 
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Figure 7. High resolution 2-DE map of proteins isolated from rat serum with or without prior exposure to an inflam- 
mation-causing agent. (A) normal rat serum, (D) acute-phase serum from rats which had previously been exposed to 
an inflammation-causing ugent. The first dimension of separation is an IPO from pH 4-10, and the second dimen- 
sion is a 7.5-I7.5%T gradient SDS-PAOE gel. Proteins were visualized by staining with araido black. Further details 
of experimental procedures are included in (14, 49). 
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Proteome analysis as a biological assay has been success- 
fully used in the field of toxicology, to characterize 
disease states or to study differential activation of cells. 
The approach is limited, of course, by the fact that only 
the visible protein spots are included in the assay* and it 
is well known that a substantial but far from complete 
fraction of cellular proteins are detected if a total cell 
lysate is separated by 2-DE, Proteins may not be 
detected in 2-DE gels because they are not abundant 
enough to be visualized by the detection method used, 
because they do not migrate within the boundaries (size, 
p/) resolved by the gel, because they are not soluble 
under the conditions used, or for other reasons. 

A different way to use proteome analysis as a biological 
assay to define the state of a biological system is to take 
advantage of the wealth of information contained in 
2-DE protein patterns, 2-DE is referred to as two-dimen- 
sional because of the electrophoretic mobility and the 
isoelectric points which define the position of each pro- 
tein in a 2-DE pattern. In addition to the two dimen- 
sions used to generate the protein patterns, a number of 
additional data dimensions are contained in the protein 
patterns. Some of these dimensions such as protein 
expression level, phosphorylation state, subcellular loca- 
tion, association with other proteins, rate of synthesis or 
degradation indicate the activity state of a protein or a 
biological system. Comparative analysis of 2-DB protein 
patterns representing different states is therefore ideally 
suited for the detection, identification and analysis of 
suitable markers. Once again it must be emphasized that 
in this type of experiment only a fraction of the cellular 
proteins is analyzed, Since many regulatory proteins are 
of low abundance, this limitation is a concern, particu- 
larly in cases in which regulatory pathways are being 
investigated. 

5 Concluding remarks 

In this report we have addressed three main issues 
related to proteome analysis. First, we have discussed 
the rationale for studying proteomes. Second, we have 
assessed the technical feasibility of analyzing proteomes 
and described current proteome technology, and third, 
we have analyzed the utility of proteome analysis for bio- 
logical research. It is apparent that proteome analysis is 
an essential tool in the analysis of biological systems. 
The multi-level control of protein synthesis and degrada- 
tion in cells means that only the direct analysis of 
mature protein products can reveal their correct identi- 
ties, their relevant state of modification and/or associa- 
tion and their amounts. . Recently developed methods 
have enabled the identification of proteins at ever- 
increasing sensitivity levels and at a high level of auto- 
mation of the analytical processes. A number of tech- 
nical challenges, however, remain. While it is currently 
possible to identify essentially any protein spots that can 
be visualized by common staining methods, it is ap- 
parent that without prior enrichment only a relatively 
small and highly selected population of long-lived, 
highly expressed proteins is observed, There are many 
more proteins in a given cell which are not visualized by 
such methods. Frequently it is the low abundance pro- 
teins that execute key regulatory functions. 



We have outlined the two principal ways proteome anal- 
ysis is currently being used to intersect with b iological 
research projects; the proteome as a database or data 
archive and proteome analysis as a biological assay. Both 
approaches have in common that at present they are con- 
ceptually and technically limited. Current proteome data- 
bases typically are limited to one cell type and one state 
of a cell and therefore do not account for the dynamics 
of biological systems. The use of proteome analysis as a 
biological assay can provide a wealth of information, but 
it Is limited to the proteins detected and is therefore not 
truly proteome-wide. These limitations in proteohiics are 
to a large extent a reflection of the fact that proteins in 
their fully processed form cannot easily be amplified and 
are therefore difficult to isolate in amounts sufficienLfor 
analysis or experimentation. The fact that to date/no 
complete proteome has been described further attest! to 
these difficulties. With continued rapid progress in pro- 
tein analysis technology, however, we anticipate that the 
goal of complete proteome analysis will eventually 
become attainable. 

We would like to acknowledge the funding for our work 
from the National Science Foundation Science and Technol- 
ogy Center for Molecular Biotechnology and from the NIH. 
We thank Yvan Rochon and Bob Franza for providing the 
yeast gel shown and Elisabetta Gianazzafor providing the 
rat serum gels shown. 

Received April 21, 1998 



6 References 

(1) Wllkins, M. it, Pasquali, C, Appel, R. D., Ou, K., Oolaz, O., San- 
chez, J,~C, Yan, J. X. t Goolcy, A. A., Hughes, O., Hurnphery- 
Smith, I., Williams, K. L., Hochstrasser, D. F., Bh/Tcchnotogy 

1996, I4 t 61-«5. 

[2J Hodges, P. E., Payne, W E-, Qarrels, J. L, Nudetc Acids Res, 1998, 
24 68-72. ' 

[3J O'Connor, C. D„ Farris, M. ( Fowler, R., Qi, S. Y, Electrophoresis 

1997, 18, 1483-1490. 

[4] Cordwell, S. J. f Basseal, D. J., HumpheryTSmith, I., Electro- 
phoresis 1997, 18, 1335-1346. 

[5J Ufquhart, B. L., Atsalos, T. E., Roacb l D., Basseal, D. J., Bjellqvlst, 
B M Bfitlon, W. L., Humpbery-Smlth, I„ Electrophoresis 1997, 18, 
1384-1392. 

[6] Wasinger, V. C, Bjellqvist, B,, Humphery-Smith, L, Electro- 

phoresis 1997, 18, 1373-1383. 
(7J Link, A. J., Hays, L. G., Carmack, B. B. t Yates III, J. R., Electro- 
phoresis 1997, 18, 1314-1334. 
[81 Sazuka, T., Ohara, 0., Electrophoresis 1997, I8 t 1252-1258, 
(9J VanBogelen, R. A., Abshire, K. Z., Moldover, B., Olson, E, R., 

Neidhardl, F. C, Electrophoresis 1997, 18, 1243-1251. 
[101 Guerreiro, N„ Redmond, J. W., Rolfe, B. G., Djordjevic, M. A*. 

Mot. Plant Microbe Interact. 1997, 10, 506-516. 
[11] Yan, I X., Tboella, L. t Sanchez, J.-C, Wilklns, M. R., Packer, 
N. H M Gooley, A. A., Hochstrasser, D, R, Williams, K. L., Electro- 
phoresis 1997, 18, 491-497. 
(12J Celis, J., Gromov, P., Ostergaard Madsen, P., Honorc, B„ Dej- 
gaard, K., Olscn, E., Vorum, FJ., Kristensen, D. B., Oromova, I., 
Haunso, A., Van Damme, }. t Puype, M., Vandekerckhove, J„ 
Rasmusscn, H. H-, FEBS Lett. 1996, 398, 129-134. 
f 1 3 J Appel, R. D. t Sanchez, J.-C, Bairoch, A., Oolaz, O., Miu, 
Vargas, J. R., Hochstrasscr, D. F., Electrophoresis 1993, //. 
1232-1238. 

[14] Haynes, P., Miller, I., Aebersold, R., Gemeiner, M., Eberini, I., 
Lovatl, R. M., Manzoni, C, Vlgnati, M., Oianazza, E;, Electro- 
phoresis 1998, 19, 1484-1492. 



EUcmphmsls 1998, 19, 1862-1871 



Proteome analysis: Biological assay or dftU archive 1871 



ti- 
ll 

:a 
h 
r> 

-a- 
te 

a 
ut 
ot 
re 
In 
id 
or 

10 

to 
ro- 
he 
Uy 



Eft 
the 
the 



Jan- 
ery- 
logy 

998, 

tstt 

*tro~ 

yht, 

ittro* 



|l5j Fleischmann, R- D., Adams, M. D., White, 0-» Clayton, R. A M 
Kirkness, E. F., Kerlavage, A, R., Bull, C. J., Tomb, J.-R, Dou- 
gherty, B A., Merrick, J. M., MeJCenncy, K., Sutton, G„ FilzHugh, 
W„ Fields, C. t Oocayne, J. D. t Scott, J. t Shirley, R. f Liu, LA., 
Olodek, A„ Kelley, J. M. f Weidrnao, I. P., Phillips, C. A., Spriggs, 
T., Hedblom, E., Cotton, M. D., Utterback, T. R., Hanna, N, C, 
Nguyen, D. T., Saudek, D, M., Brandon, R. C, Fine, L. D., 
Fritchman, J. L., Fuhrmann, J. L., Geoghagen, N. S. M„ Onehm, 
C. L., McDonald, L. A., Small, K. V., Fraser, C. M., Smith, C. 6., 
Venter, I. C, £c/<«« 1995, Jdy", 496-512. 

(16) Ooffeau, A., Barrell, B. Bussey, H., Davis, R. W., Dujon, B., 
Feldmann, H., Galibcrt, R, Hohcisel, J, D., Jacq, C„ Johnston, M., 
Louts, E, J., Mewes, H. W., Murakami, Y„ Philippsen, R.Tettelin, 
H M Oliver, S. O., Science 1996, 274, 546. 

(171 Fraser, C. M,, Casjens, S., Huang, W. M., Sutton, O. O., Clayton, 
R., Uthigra, R„ White, 0., Ketchum, K. A., Dodson, R., Hlckey, 
B. K.i Owinn, M», Dougherty, B.,Tbmb, J. P., Fleischmann, R. D., 
Richardson, D., Peterson, J., Kerlavage, A. R., Quackenbush, J., 
Salzberg, S., Hanson, M., van Vugt, R., Palmer, N., Adams, M. D., 
Oocayne, J., Weidman, J., Utterback, T M Walthey, T., McDonald, 
L. t Aftlach, P., Bowman, C, Garland, S., Fujll, C, Cotton, M. D., 
Horst, K., Roberts, K,, Hatch, B., Smith, H, Venter, J. C, 
ffature 1997, 390, 580-586. 

(18) Liang, P., Pardee, A. B., Science 1992, 2S7, 967-971, 

[19] Lashkari, D. A., DeRlsl, I L„ McCusfcer, J H,, Namath, A. F. f 
Gentile, C., Hwang, S. Y., Brown, R O., Davis, R. W., Proc Natl. 
Acad. Scl. USA 1997, 94, 13057-13062. 

(20) Shalon, D„ Smith, S. J., Brown, P. O., Genome Res. 1996, 6, 
639-645 T 

(21) Velculescu, V. E,, Zhang, L,, Voge'lstein, B., Kinzler, K. W„ Science 

1995, 270, 484-487. 

(22J Velculescu, V. E„ Zhang, L., Zhou, W., Vogelsteln, J., Basra!, 

M. A„ Bassett, D. E., Hieter, P., Vogelstein, B„ Kinzler, K. W., 

Cell 1997 88 243—251. 
[23] Krishna, R. 6., Wold, F., Adv. Enamel. 1993, 67, 265-298. 
(24] Odrg, A., Postel, W„ Gunther, S., Electrophoresis 1988, P, 531-546. 
{25] Klose, I, Kobalz, U., Electrophoresis 1995, 16, 1034-1059. 
{26] Majsudaira, P., X to/. Oum. 1987, 2<tf, 10035-10038. 
[27] Aebersold, R H., Ifcplow,. D. B,, Hood, L, E-, Kent, S. B., /. Biol. 

Chcm. 1986, Jo7, 4229-4238. 
(28] Rosenfeld, J., Capdevielle, Jf. ( Guillemot, J. C, Ferrara, P., Anal. 

Blocker 1992, 203, 173-179. 
(29] Aebersold, R H., Lcavitt, J M Saavedra, R, A., Hood, L. E., Kent, 

S, B., PrOc, NatL Acad, Scl, USA 1987, 84, 6970-6974. 
[30] Honor6, B., LelTers, H., Madsen, P., Cells, J. E., Eur. I Btochem. 

1993, 218, 421-430. 
{311 Maim, M„ Wilm, M„ Anal Ckem. 1994, 66, 4390-4399. 
[32] Brig, J M McCormack, A. L., Yates HI, J. R,, /. Amer. Mass Spec- 

from. 1994, 5, 976-989. 
(33] Yates HI, J. R., Bag, J. K„ McCormack, A. L., Schlellz, D.. Anal. 

Chem. 1995, 67, 1426-1436. 
(34] Shevchonko, A., Wilm, M., Vorm, O., Mann, M., Anal. Chem. 

1996, 68, 850-858. 



(35) Hess, D M Covey, T. C, Winz, R., Brownsey, R. W., Aebersold, R, f 
Protein Set. 1993, 2, 1342-1351. 

(36) van Oostveen, I., Ducret, A., Aebersold, R. ? Anal. Btochem. 1997, 
247, 310-318. 

(37J Lui, M., 1bmp8t, P., Hrdjument-Bromage, Hi, Anal. Blochem. 1996, 
241, 156-166. 

[38| Patterson, S. D, Aebersold, R. A., Electrophoresis 1995, /<5, 
1791-1814. 

(39) Ducret, A,, Foyn, Bnmn, C, &ures, B. J., Marhaug, G., Husby, 

O. R. A„ Electrophoresis 1996, i/ f 866-876. 
(40] Haynes, P. A., Frlpp, N., Aebersold, R„ Electrophoresis 1998, I9 f 

939-945. 

(41) Figeys, D., Van Oostveen, L, Ducret, A., Aebersold, R., ^na/. 
CAew. 1996, 68, 1822-1828. 

(42J Ducret, A. f Van Oostveen, I , Bng, J. K., Yates III, J. R. t Aeber- 
sold, R., Protein Set, 1997, 7, 706-719. 

(43) Flgeys, D., Ducret, A., Yates HI, I. R„ Aebersold, R., Nature Bio- 
tech. 1996, 14, 1579-1583. 

(44) Figeys, D., Aebersold, R., Electrophoresis 1997, 18, 360-368. 
(45} Figeys, D., Ning, Y, Aebersold, R., Anal. Chem. 1997, 69, 

3153-3160. 

(46] Oarrels, J, I., McLaughlin, C, S„ Warner, J. R.> Futcher, B., Latt&r, 
O. I., Kobayashi, R„ Schwendcr, B., Volpe,T., Anderson, D. S. # 
Mesquita-Fuentes, R., Payne, W. B., Electrophoresis 1997, 19, 
1347-13 60, 

(47] Schuler, G. D>, Boguski, M. S., Stewart, E. A., Stein, L. P., 
Oyapay, G., Rice, K.., White, R. E„ Roddguez-lbme, P., Aggarwal, 
A., Bajorek, B. f Bentolila, S., Birren, B. B., Butler, A., Castle, 
A. B., Chlannilkulchal, N., Chu, A., Clee, C, Cowles, S., Day, 
P. J., Dibling, T., Drouot, N., Dunham, !,, Duprat, S., Edwards, C, 
Fan, J. B., Fang, N., Flzames, C, Garrett, C, Green, L., Hadlcy, 
D., Harris, M., Harrison, P., Brady, S., Hicks, A„ Holloway, E. f 
Hui, L. t Hussain, S. f Louls-Dlt-Suily, C, Ma, I., MacOUvery, A., 
Mader, C, Maratufculam, A., Matlse, T C, McKusick, K. B. t 
Morlssette, J., Mungalt, A„ Muselet, D., Nusbaum, H. C, Page, 
D. C:, Peck, A., Perkins, S., Piercy, M„ Qin, F., Quackenbush, J. fc . 
Ranby, S., Reif, T., Rozen, S„ Sanders, X., She, X., Silva, J., 
Slonim, D. K., Soderlund, C, Sun, W^L., Tabar, P., Thangarajab, 
T., Vega-Czarny, N., Vollrath, D., Voyticky, S., Wllmer, T., Wu f X., 
Adams, M. D., Auffray, C, Waller, N. A, R., Brandon, R,, Dehejla, 
A., Goodfeltow, P. N., Houigatte, R., Hudson, 3. R., Jr., Ide, S. R t 
lorlo, K, R., Lee, W. Y, Sckl, N., Nagase,.T., Ishikawa, K,, 
Nomura, N., Phillip?, C. Polymeropoulos, M. H., Sanduslqf, M. t 
Schmitt, K., Beny, R., Swanson, K., Torres, R., Venter, J, C, 
Slkela, J. M. f Beckmann, J. S,, Weissenbach, J., Myers, R. M M Cox, 
D. 1L, James, M. R-, Bentley, D., et at Science 1996, 274, 540-546. 

(48| Sanchez, J -C., Rouge, V., Pisteur, M., Ravler, F., Tbnella, 

Moosmayer, M., Wilkins, M. R., Hochstrasser, D. P., Electro- 
phoresis 1997. 18, 324-327. 

[49) Milter, I,, Haynes, P., Gemeiner, M„ Aebersold, R., Manzoni, C, 
Lovati, M. R., Vignati, M„ Eberlnl, I., Gianazza, E., Electro- 
phoresis 1998, 19, 1493-1500. 

[50) Garrels, J, l„ Nucleic Acids Res. 1996, 24, 46-49, 



,R., 

. A., 

jcker, 
ectrO- 

Dej- 
*a, L, 
t, I, 



> M., 
. 14, 



li, I., 

lectro- 



EDITOR-IN-CHIEF 

William S. Hancock 

Bamett Institute and 
Department of Chemistry 
Northeastern University 
360 Huntington Avenue 
341 MugarBidg. 
Boston, MA 02115 
617-373-4881; Fax: 617-373-2855 
whancock@acs.org 

ASSOCIATE EDITORS 
Joshua LaBaer 

Harvard Medical School 
Gydrgy Marko-Varga 

AstraZeneca and Lund University 

EDITORIAL ADVISORY BOARD 

Rued! H. Aebersold 

Institute for Systems Biology 

Leigh Anderson 

Plasma Proteome Institute 

Ettore Appella 

National Cancer Institute 

Rolf Apweller 

European Bbinformatics Institute 

Ronald Beavis 

Manitoba Centre for Proteomics 

Walter Blackstock 

Ceiizome 



The Rockefeller University 
Patrick u Coleman 

3M 

Christine Colvis 

National Institutes of Health 

Catherine Fenselau 

University of Maryland 

Daniel FIgeys 

MDS Proteomics 
Sam Hanash 

University of Michigan 

Stanley Hefta 

Bristol-Myers Squibb 
Donald F. Hunt 

University of Virginia 
Barry L Karger 

Northeastern University 

Daniel C. Liebter 

Vanderbilt University School of Medicine 

Lance Llotta 

National Cancer Institute 
Matthias Mann 
University of Southern Denmark 
Stephen A. Martin 

Applied Biosystems 



Imperial College of London 

Gilbert S. Omenn 

University of Michigan 

Emanuel Petrlcoin 

Food and Drug Administration 

J. Michael Ramsey 

Oak Ridge National Laboratory 

Pier Giorgio Rlghettl 

University of Verona 

John T. Stults 

Biospect 

Peter Wagner 

Zyomyx 

Keith Williams 

Proteome Systems 
Ql-Chang Xia 

Shanghai Institute of Biochemistry 

John R. Yates, III 

The Scripps Research Institute 



editorial protj©OITlii 

**** 

Do We Have Enough Biomarkers? 

The Editor has become aware of a recent push to validate currently available biomark- 
ers in an extensive clinical setting. The reasoning behind such a push is that there are 
already a significant number of biomarkers that now need to be used effectively in the 
clinic. Many biomarkers, such as the carclnoembryonic antigen, have been known for some 
time and are used widely for patient management. The older biomarkers, however, are 
not effective for early diagnosis. 

With the advent of genomics and, later, proteomics, there has been a substantial 
investment in using these new tools to generate additional biomarkers. The problem 
with this new information is that it is too early to get consensus on what is a useful marker 
or what is a good patient population for such a study. Therefore, it is unclear whether 
the new markers currently in hand will give better clinical information than the ones 
that have been used in the past. An additional problem is that the markers that are gen- 
erated by proteomics are not always consistent with the markers that are generated from 
expression profiling. 

The challenge in this situation is to balance the need of patients for better, early diag- 
nosis of disease with the need to have high-quality markers for the expensive and time- 
consuming validation process. This Editor believes that proteomics is at too early a stage 
for this new technology to have generated a quality list of markers. The risk is if we push 
the existing markers into extensive clinical validation, we will be missing the fruits of 
improvements in emerging proteomics technology. I think many people in the proteomics 
community would agree that federal granting agencies should be enticed to continue 
investments in basic proteomics technology. In addition, funding should be made 
available for basic science studies that will continue to generate biomarkers, and there 
needs to be some type of consensus-building process that can lead to a consolidation 
of the different lists of biomarkers. 

There are good past models for such activities, such as the consensus-forming meet- 
ings that the U.S. Food and Drug Administration has held; these yielded technical inno- 
vations. One example was the generation of new protein pharmaceuticals at the advent 
of the biotechnology industry. Another example, in the early days of die genome sequenc- 
ing program, was when a group of experts came together to agree on annotation of the 
early results. The Human Proteome Organization is a good example of an international 
group of laboratories coming together to consolidate the output from a number of 
studies with different technology platforms. 

I would like to encourage the biomedical community not to rush to judgment in 
terms of biomarkers, but instead to give research more time to produce quality biomarker 
information. Then we should conduct a thorough evaluation of a widely agreed-on list 
before we attempt to determine which of these new markers are indeed worthy of exten- 
sive clinical investigation. 




© 2004 American Chemical Society 
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Analysis of Genomic and Proteomic Data Using Advanced Literature 
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High-throughput technologies, such as proteomic screening and DNA micro-arrays, produce vast 
amounts of data requiring comprehensive analytical methods to decipher the biologically relevant 
results. One approach would be to manually search the biomedical literature; however, this would be 
an arduous task. We developed an automated literature-mining tool, termed MedGene, which 
comprehensively summarizes and estimates the relative strengths of all human gene-disease 
relationships in Medline. Using MedGene, we analyzed a novel micro-array expression dataset 
comparing breast cancer and normal breast tissue in the context of existing knowledge, We found no 
correlation between the strength of the literature association and the magnitude of the difference in 
expression level when considering changes as high as 5-fold; however, a significant correlation was 
observed (r = 0.41; p - 0.05) among genes showing an expression difference of 10-fold or more. 
Interestingly, this only held true for estrogen receptor (ER) positive tumors, not ER negative. MedGene 
identified a set of relatively understudied, yet highly expressed genes in ER negative tumors worthy of 
further examination. 

Keywords: bioinformatlcs • micro-array • text mining ♦ gene-disease association • breast cancer 



Introduction 

At its current pace, the accumulation of biomedical literature 
outpaces the ability of most researchers and clinicians to stay 
abreast of their own immediate fields, let alone cover a broader 
range of topics. For example, to follow a single disease, e.g., 
breast cancer, a researcher would have had to scan 130 different 
journals and read 27 papers per day in 1999. 1 This problem is 
accentuated with high-throughput technologies such as DNA 
micro-arrays and proteomlcs, which require the analysis of 
large datasets involving thousands of genes, many of which are 
unfamiliar to a particular researcher. In any microarray experi- 
ment, thousands of genes may demonstrate statistically sig- 
nificant expression changes, but only a fraction of these may 
be relevant to the study. The ability to interpret these datasets 
would be enhanced if they could be compared to a compre- 
hensive summary of what Is known about all genes. Thus, there 
is a need to summarize existing knowledge in a format that 
allows for the rapid analysis of associations between genes and 
diseases or other specific biological concepts. 

One solution to this problem is to compile structured digital 
resources, such as the Breast Cancer Gene Database 1 and the 
Tumor Gene Database. 2 However, as these resources are hand- 
curated, the labor-intensive review process becomes a rate- 
limiting step in the growth of the database. As a result, these 

* To whom correspondence should be addressed: JIabaer@lwm.harvard.edu. 
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databases have a limited scale and the genes are not selected 
in a systematic fashion. 

An alternative approach is automated text mining; a method 
which involves automated information extraction by searching 
documents for text strings and analyzing their frequency and 
context. This approach has been used successfully in several 
instances for biological applications. In most cases, it has been 
applied to extract information about the relationships or 
interactions that proteins or genes have with one another, in 
the literature or by functional annotation. 3 " 7 Thus far, few 
publication have applied text-mining to examine the global 
relationships between genes and diseases. Perez-Iratxeta et al. 
automatically examined the GO (Gene Ontology) annotation 
of genes and their predicted chromosomal locations in order 
to identify genes linked to inherited disorders. 8 

To obtain a more global understanding of disease develop- 
ment, it would be valuable to incorporate information regarding 
all possible gene-disease relationships, including biochemical, 
physiological, pharmacological, epidemiological, as well as 
genetic. This information would enable comprehensive com- 
parisons between large experimental datasets and existing 
knowledge In the literature. This would accomplish two things. 
First, it would serve to validate experiments by demonstrating 
that known responses occur as predicted. Second, it would 
rapidly highlight which genes are corroborated by the literature 
and which genes are novel in a given context. We have utilized 
a computational approach to literature mining to produce a 
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comprehensive set of gene-disease relationships. In addition, 
we have developed a novel approach to assess the strength of 
each association based on the frequency of citation and co- 
citatlon. We applied this tool to help interpret the data from a 
large micro-array gene expression experiment comparing 
normal and cancerous breast tissue. 

Methods 

MedGene Database. MedGene is a relational database, stor- 
ing disease and gene information from NCBI, text mining re- 
sults, statistical scores, and hyperlinks to the primary lit- 
erature. MedGene has a web-based user interface for users to 
query the database (http://hipseq.med.harvard.edu/MedGene/). 

Text Mining Algorithms. MeSH files were downloaded from 
the MeSH web site at NLM (Nation Library of Medicine) (http:// 
www.nlm.nih.gov/mesh/meshhome.html) and human disease 
categories were selected. LocusLink files were downloaded from 
the LocusLink web site at NCBI (http://www.ncbi.nih.gov/ 
LocusLink/). Official/preferred gene symbol, official/preferred 
gene name, and gene alternative symbols and names, all 
relevant annotations and URLs for each LocusLink record, were 
collected. Gene search terms were used for literature searching 
and included all qualified gene names, gene symbols, and gene 
family terms. Primary gene keys, predominantly qualified gene 
family terms and gene official/preferred symbols, were used 
to index Medline records. If the official/preferred gene symbols 
did not meet the standards to be an index, then qualified gene 
official/preferred names were used. A local copy of Medline 
records (up to July, 2002) was pre-selected. 

A JAVA module examined the MeSH terms and then indexed 
each Medline record with the appropriate disease terms. A 
separate JAVA module was used to examine the titles and 
abstracts for gene search terms and then to index the gene- 
related Medline records with the relevant primary gene key(s). 

Statistical Methods. For every gene and disease pair, we 
counted records that were indexed for both gene and disease 
(double positive hits), for disease only (disease single hits), for 
gene only (gene single hits), and for neither gene nor disease 
(double negative hits) to generate a 2 x 2 contingency table. 
On the basis of the contingency table-framework, we applied 
different statistical methods to estimate the strength of gene- 
disease relationships and evaluated the results. These methods 
included chi-square analysis, Fisher's exact probabilities, rela- 
tive risk of gene, and relative risk of disease 16 (http:// 
hlpseq.med.harvard.edu/MedGene/), In addition, we computed 
the "product of frequency", which is the product of the 
proportion of disease/gene double hits to disease single hits 
and the proportion of disease/gene double hits to gene single 
hits. To obtain a normal distribution, we transformed all the 
Statistical scores using the natural logarithm. We selected the 
log of the product of frequency (LPF) to validate MedGene and 
to use for the analysis with the micro-array data. Spearman 
rank-correlation coefficients were used to assess the linear 
relationship between LPF and micro-array fold change in 
expression level. 

Global Analysis. Diseases with at least 50 related genes were 
selected for clustering analysis, and the LPF scores were 
normalized with total score for each disease. Hierarchical 
clustering was done with the "Cluster" software and the 
clustering result was visualized using "TreeVlewer" (http:// 
ranaJbl.gov/EisenSoftware.htm). 
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Breast Tissue Micro-Arrays. Eighty-nine breast cancer 
samples (79% ER-positive) and 7 normal breast tissue samples 
were selected from the Harvard Breast SPORE frozen tissue 
repository and were representative of the spectrum of histo- 
logical types, grades, and hormone receptor immuno-pheno- 
types of breast cancer. Biotinylated cRNA, generated from the 
total RNA extracted from the bulk tumor, was hybridized to 
Affymetrix U95A oligo-nucleotide micro-arrays. These micro- 
arrays consist of 12 400 probes, which represent approximately 
9000 genes. Raw expression values were obtained using GENE- 
CHIP software from Affymetrix, and then further analyzed using 
the DNA-Chip Analyzer (dChlp) custom software. 

Results 

Automated Indexing of Medline Records by Disease and 
Gene. To study the gene-disease associations in the literature, 
we first compiled complete lists for human diseases and human 
genes. To index all Medline records that were relevant to 
human diseases, the Medical Subject Heading (MeSH) index 
of Medline records was utilized. MeSH is a controlled medical 
vocabulary from the National Library of Medicine and consists 
of a set of terms or subject headings that are arranged in both 
an alphabetic and an hierarchical structure. Medline records 
are reviewed manually and MeSH terms are added to each with 
software assistance. 910 Twenty-three human disease category 
headings along with all of their child terms (see the Supporting 
Information, Supplemental Table 1, or visit http://hipseq. 
med.harvard.edu/MedGene/publicadon/s_Table l.html) were 
selected from the 2002 MeSH index creating a list of 4033 
human diseases. 

No index comparable to the MeSH index exists for genes, 
and thus, it was necessary to apply a string search algorithm 
for gene names or symbols found in Medline text, A complete 
list of genes, gene names, gene symbols, and frequently used 
synonyms were collected from the LocusLink database at 
NCBI, 12 which contains 53 259 independent records keyed 
by an official gene symbol or name (June 18 th , 2002). For the 
purposes of this study, no distinction was made between genes 
and their gene products. Authors often use the same name for 
both, differentiating the two only by the use of Italics, if at all. 
For the intended use of this study, this lack of distinction is 
unlikely to have a large effect and may in fact be beneficial. 

Initial attempts to search the literature using these lists 
revealed several sources of false positives and false negatives 
(Table 1). False positives primarily arose when the searched 
term had other meanings, whereas false negatives arose from 
syntax discrepancies necessitating the development of filters 
to reduce these errors. The syntax issues were readily handled 
by including alternate syntax forms In the search terms. The 
false positive cases, caused by duplicative and unrelated 
meanings for the terms, were more difficult to manage. Where 
possible, case sensitive string mapping reduced inappropriate 
citations. In many cases, however, this was not sufficient and 
the terms had to be eliminated entirely, thereby reducing the 
false positive rate but unavoidably under-representing some 
genes. 

For the purposes of data tracking, a primary gone key was 
selected to represent all synonyms that correspond to each 
gene. Medline records were indexed with a primary gene key 
when any synonym for that key was found in the title or 
abstract. Case-insensitive string mapping was used for all 
searches except as noted above. No additional weight was 
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Table 1. Systematic Sources of False Positives and False Negatives in Unfiltered Data* 



source of error 



error type 



example 



filter solution 



gene symbol/name 
is not unique 



gene symbol is 

unrelated abbreviation 
gene symbol/name 

has language meaning 
nonstandard syntax 
unofficial gene name/symbol 
nonspeclfied gene name 



false positive MAG— myelin 

associated glycoprotein 
M/lC-malignancy-assoclated 
protein 

false positive fl4-pallid homology© (mouse), 

pallidin (also abbrev. for Pennsylvania) 

false positive W4S-Wiskou>Aldrich Syndrome 

(also the word "was") 

false negative BAG- 1 instead of BAGl 

false negative P53 instead of TP53 

false negative estrogen receptor instead of 
Estrogen receptor 1 



eliminate this term 

eliminate this term 

case-sensitive string search 

add dash term 

add all gene nicknames 

add family stem term 



" In preliminary studies. Medline was searched for co-occurrence of genes and diseases and tho resulting output was evaluated to identify error sources that 
were amenable to global Alters, Each error source is categorized by the type of error it causes: false positives are Suggested relationships that are not real and 
false negatives are real relationships that are underrepresented. The filter solutions used are Indicated. Note that in some cases, the filter solution itself introduces 
error, In general, error rates maximized sensitivity, even at the expense of specificity if needed. 



added for multiple occurrences of a term or the co-occurrence 
of multiple synonyms for the same gene key. 

Medline records were searched with all qualified gene 
identifiers, such as the official/preferred gene symbol, the 
official/preferred gene name, all gene nicknames and all syntax 
variants. In situations where there are several members of a 
gene family or splice variants, some authors prefer to use a 
shortened gene family name, e.g., estrogen receptor instead of 
estrogen receptor 1 (ESR!) t creating a source of false negatives. 
For this reason, gene family stem terms were created for all 
genes that have an alpha or numerical suffix (e.g.. IL2RA, TGFfi, 
ESRl t etc.) and then used to search the literature. The family 
stem terms were handled separately from the specific gene 
names so that it would be clear when linkages were made to 
the gene family versus a specific member in that family. 

To improve performance and accuracy, some pre-selection 
was applied to the records that were scanned, First, review 
articles were eliminated to avoid redundant treatment of 
citations. Second, non-English journals were removed because 
the natural language filters were only relevant to English 
publications. Finally, journals unlikely to contain primary data 
about gene-disease relationships were also removed (e.g., Int. 
J. Health Educ. t Bedside Nurse, and /, Health Econ), Together, 
these filters reduced the 12 198 221 Medline publications (July 
2002) by 37%. 

Ranking the Relative Strengths of Gene-Disease Associa- 
tions. In total, there were 618 708 gene-disease co-citations, 
in which 16% (8297) of all studied genes had been associated 
to a disease and 96% (3875) of all diseases had been associated 
to at least one gene. To rank the relative strengths of gene 
disease relationships, we tested several different statistical 
methods and examined the results. With die exception of the 
relative risk estimates, the methods provided similar results 
with respect to the rank order of the gene-disease association 
strengths. However, after comparing the results to other 
databases and after consulting disease experts, the log of the 
product of frequency (LPF) was selected for further analysis 
because it gave the best results overall. 

Validation of MedGene. In developing this tool. It was 
important to minimize the number of missed genes (false 
negatives) and miscalled genes (false positives). However, in 
situations when these goals were In conflict, inclusiveness was 
prioritized. To determine the false negative rate in MedGene, 
breast cancer was used as a test case because it was associated 
with more genes than any other human disease and because 




Figure 1. Estimation of the false negative rate by comparison 
with hand-curated databases. The breast cancer-related genes 
identified by MedGene were compared with those listed in 
several other databases including the Tumor Gene Database 
(TGD), 2 the Breast Cancer Gene Database(BCG), 1 GeneCards 
(GC) 17 and Swissprot, 18 Genes were considered false negatives 
if they were represented in at least one of these other databases 
and not in MedGene and their link to breast cancer was sup- 
ported by at least one literature reference. All literature references 
were verified by manual review to confirm their validity. The 
number of genes in each database or shared by more than one 
database is indicated. The false negative rate was calculated by 
genes missed at MedGene (26)/total number of nonoverlapping 
genes in other databases (285). 

there were several public databases that link genes to breast 
cancer. We compared the list of breast cancer-related genes 
from MedGene to these databases, illustrated in Figure 1. 
Among the 285 distinct breast cancer-related genes that were 
supported by at least one literature citation in these hand- 
curated databases, 26 were absent from MedGene, suggesting 
a false negative rate of approximately 9%, To determine why 
these were missed, all literature references for these genes (80 
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papers) were reviewed manually (see the Supporting Informa- 
tion, Supplemental Table 2, or visit http://hipseq.med. 
harvard.edu/MedGene/publication/s_TabIe 2.html). Among 
these papers, most false negatives were caused by nonstandard 
gene terms or gene terms eliminated by our specificity niters. 
Few genes were missed because they were only mentioned in 
review papers (0.4%) or they appeared only in the body of the 
manuscript but not the abstract or title (1.1%). Of note, 
MedGene identified approximately 2000 additional breast 
cancer-related genes not listed in any other database. 

To assess the false positive error rate, two complementary 
approaches were used: a detailed analysis of one disease and 
a global examination of 1000 diseases. The detailed approach 
examined the false positive error rate and its sources, whereas 
the global approach tested whether the overall results made 
biomedical sense. 

Using, the LPF, 1467 genes related to prostate cancer were 
assembled in rank order. We then retrieved approximately 300 
Medline records each for the highest ranked 100 and the lowest 
ranked 200 genes and manually reviewed the titles and 
abstracts to determine the verity of the association. Nearly 80% 
of the highest ranked 100 genes fell into one of the five 
categories that reflect meaningful gene-disease relationships 
(see the Supporting Information, Supplemental Table 3, or visit 
http://hipseq.med.harvard.edu/MedGene/publicatlon/ 
s_Table 3.html). Among the lowest ranked 200 genes, ap= 
proximately 70% reflected true relationships. Of the 600 records 
reviewed, there were only two in which the association between 
the gene and the disease was described as negative. Both were 
genes with very low scores. In both cases, the authors did not 
argue the absence of any relationship, but rather that a 
particular feature of the gene or protein was not shown to be 
related to human prostate cancer. 13,14 

The coincidence of some gene symbols with medical ab- 
breviations, chemical abbreviations and biological abbrevia- 
tions resulted in most of the false positives (see the Supporting 
Information, Supplemental Table 4, or visit http://hipse- 
q.med 4 harvard.edu/MedGene/publication/s^.Table 4.html), em- 
phasizing the importance of the filters that were added in the 
search algorithm (Table 1). Without the filters, the false positive 
rate more than doubled, and the false negative rate rose 
dramatically (data not shown). For example, among the papers 
about breast cancer, there were only 12 Medline records that 
referred to ESR1 and 10 to BSR2 t whereas almost 2000 papers 
mentioned estrogen receptor without specifying ESRI or ESR2, 
this latter group was detected by the family stem term filter. 

To further validate these results, a global analysis of the gene- 
disease relationships described by MedGene was performed. 
For this experiment, it was reasoned that the more closely 
related the diseases are to one another, the more they will be 
related to the same gene sets. Thus, if the relationships defined 
by MedGene accurately reflected the literature, then an unsu- 
pervised hierarchical clustering of the gene data should group 
diseases in a manner consistent with common medical think- 
ing. Conversely, if the clustered diseases do not make sense 
biologically or medically, it may reflect excessive false positives, 
false negatives, or inappropriate scoring of the data. 

To execute this experiment, the gene sets and the corre- 
sponding LPF values for 1000 randomly selected diseases (each 
with at least 50 gene relationships) were used as a dataset for 
clustering the diseases. A review of the results showed that the 
resulting disease clusters were indeed logical based upon 
common medical knowledge (see the Supporting Information, 
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Supplemental Figure 1, or visit http://hipseq.med.harvard.edu/ 
MedGene/publicatlon/s.Figure l.html). For example, in one 
such cluster shown in Figure 2, diabetes and its complications 
grouped together and were also closely linked to diseases 
associated with starvation states. 

The number of genes associated with a given disease can 
be estimated by adjusting the MedGene number up by the false 
negative rate (-9%) and down by the false positive rate (-26% 
on average). Using this, the average disease has 103.7 ± 45.3 
(mean db s.d.) genes associated with it, although the range is 
quite broad with 2359 genes related to breast cancer, 2122 
genes related to lung cancer and no genes related to a number 
of diseases. 

Applying MedGene to the Analysis of Large Datasets. Access 
to a comprehensive summary of the genes linked to human 
diseases provided an opportunity to analyze data obtained from 
a high-throughput experiment. We compared the MedGene 
breast cancer gene list to a gene expression data set generated 
from a micro-array analysis comparing breast cancer and 
normal breast tissue samples. Micro-array analysis Identified 
2286 genes that had greater than a 1-fold difference in mean 
expression level between breast cancer samples and normal 
breast samples. Using MedGene, we sorted the 2286 genes into 
four classes: 555 genes directly linked to breast cancer in the 
literature by gene term search (first-degree association by gene 
name): 328 genes directly linked by family term search (first- 
degree association by family term): 1021 genes linked to breast 
cancer only through other breast cancer genes (second-degree 
association): and 505 genes not previously associated with 
breast cancer. (See the Supporting Information, Supplemental 
Figure 2, or visit http://hipseq.med.harvard.edu/MedGene/ 
publication/s_Figure 2.html.) Among the 505 previously un- 
related genes, 467 were either newly identified genes or genes 
that had not previously been associated with any disease. 
Among the remaining 38 genes, 9 had been related to other 
cancers, specifically esophageal, colon, uterine, skin, and cervix. 

To determine whether the genes highlighted by the micro- 
array analysis were more likely to have been previously linked 
to breast cancer in the literature, we created a two-dimensional 
plot of the fold change of expression level between breast 
cancer and normal tissue versus the literature score (LPF) 
(Figure 3A). There was a broad spread of expression changes 
among the genes direcdy linked to breast cancer ranging from 
less than 1-fold change (68%) to over 40-fold (0.3%). Notably, 
the majority of genes with greater than 10-fold expression 
changes were linked to breast cancer by first-degree associa- 
tion. 

Among all 754 genes directly linked to breast cancer in the 
literature, there was no correlation between LPF and micro- 
array fold change (r- 0.018, p-value = 0.62). However, when 
we stratified the analysis based on the magnitude of the fold 
change, we observed an increasing trend in correlation (Figure 
3B) suggesting that genes with a more substantial change in 
expression level were more likely to have a stronger association 
in the literature. For genes that had 10-fold change or more in 
expression level, the correlation increased to 0.41 (p-value — 
0.05). 

When we evaluated the micro-array data separately for ER 
positive and ER negative tumors, the trend in correlation 
between fold change and literature score was highly dependent 
on estrogen receptor status, interestingly, there was a similar 
trend in correlation for ER positive tumors, but no trend in 
correlation for ER negative tumors. 
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B 



Coxsackievirus Infections 
Obesity in Diabetes 
Diabetic Ketoacidosis 
Glucose Intolerance 

Diabetes Mellitus, Non- Insulin-Dependent 

Diabetes Mellitus, Insulin -Dependent 

Pregnancy in Diabetics 

Diabetic Retinopathy 

Diabetic Angiopathies 

Diabetic Neuropathies 

Glycosuria 

Hyperinsulinism 

Hyperinsulinemia 

Hypoglycemia 

Hyperglycemia 

Diabetes Mellitus, Experimental 
Diabetes Mellitus 
Diabetes, Gestational 
l^taTvafffdn 
Jaundice, Neonatal 
Brain Edema 
Pulmonary Edema 
Nutrition Disorders 
Kwashiorkor 
Critical Illness 
Burns 

Diabetic Nephropathies 

Albuminuria 

Insulinoma 




Figure 2. Global validation by clustering analysis. 2(A). The gene sets and the corresponding LPF values for 1000 diseases, each with 
at least 50 gene relationships, were used in an unsupervised clustering of the diseases based on the gene patterns associated with 
them. A sample of the data is shown here. 2(B). One of the resulting clusters is shown that corresponds to blood sugar states. Diabetes 
terms (above the line) and starvation states terms (under the line) clustered together. Within these groups, there is also clustering of 
diabetic small vessel complications, altered serum chemistries, nutritional disorders, etc.(Supplemental Figure 1: http://hipseq.med, 
harvard.edu/MedGene/publication/s^Figure 1.html). 

Finally, to validate our findings, we computed similar cor- disease unrelated to breast cancer. As expected, we did not 
relations between the breast cancer expression data and observe an increasing trend in correlation for hyperten- 
LPF scores generated by MedGene for hypertension, a sion. 
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Figure 3. Relationship between literature score and functional data for breast cancer. 3A. The data from an expression analysis of 
samples for breast tumors and normal breast tissue were analyzed to indicate the fold difference of expression level between breast 
tumor and normal sample (cutoff > 3-fold change). The fold changes were plotted against the literature Score for the same gene set. 
Green dots represent first-degree association by gene search, blue dots represent first-degree association by family search and red 
dots represent no-association. Some well-studied genes, such as BRCA2 (pink circle), are not reflected by a substantial difference in 
expression level. Furthermore, the majority of genes that have no association with breast cancer in the literature had less than 10-fold 
expression changes (shaded area). 3B. The Spearman rank-correlation coefficients between literature score (LPF) and the fold change 
of expression level between tumor and normal breast samples (y-axis) in relation to the amount of fold change of expression level 
(x-axis). Gene rank lists were generated for breast cancer (blue) and hypertension (pink). Correlations were also computed between 
the breast cancer gene LPF scores and fold change expression data among estrogen receptor positive tumors only (light blue) and 
estrogen receptor negative tumors only (purple). 
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breast neoplasms 


hypertension 


rheumatoid arthritis 


bipolar disorder 


atherosclerosis 


estrogen receptor 


REN 


RA 


ERPA1 


apolipoprotein 


PGR 


DBP 


TNFRSFWA 


SNAP29 


APOE 


ERBB2 


LEP 


CRP 


PFKL 


LDLR 


BRCAl 


ACT 


AS 


DRD2 


ELN 


BRCA2 


INS 


ESR] 


TRH 


ARGl 


EGFR 


kallikrein 


HLA-DRBl 


IMPA2 


APOB 


CYP19 


ACE 


DR1 


HTR3A 


APOAJ 


TFF1 


endothelin 


intedeukin 


DRD3 


MSR1 


PSEN2 


SW0A6 


TNF 


RUM 


LPL 


TP53 


BDK 


116 


KCNN3 


P0N1 










plasminogen 


CES3 


DIANPH 


collagen 


DRD4 


activator inhibitor 


CEACAM5 


SARI 


ILIA 


HTR2C 


PLC 










vascular cell 


ERBB3 


PIH 


ACR 


RELN 


adhesion molecule 


cyclin 


CD59 


TNFRSF12 


DBH 


ATOM 


C0X5A 


ALB 


112 


MAOA 


VWF 


cathepsln 
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insulin-like 
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intercellular 
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ATP1B3 
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SIL 
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* MedGene results for the top 25 genes associated with breast neoplasms, hypertension, rheumatoid arthritis, bipolar disorder, and atherosclerosis, respectively, 
ranked by LPF scores. The hyperlink to all the papers co-citing the gone and the disease is available at MedGene website (http://hipseq.med.harvard.edu/ 
MedGene/). 



Discussion 

The Human Genome Project heralded a new era in biological 
research where the emphasis on understanding specific path- 
ways has expanded to global studies of genomic organization 
and biological systems. High-throughput technologies can 
provide novel insight into comprehensive biological function 
but also introduces new challenges. The utility of these 
technologies is limited to the ability to generate, analyze, and 
interpret large gene lists, MedGene, a relational database 
derived by mining the information in Medline, was created to 
address this need. MedGene users can query for a rank-ordered 
list of human gene-disease relationships fTable 2) for one or 
more diseases. Each entry is hyperlinked to the original papers 
supporting each association and to other relevant databases. 

MedGene Is an Innovative extension of previous text mining 
approaches. Perez-Iratxeta et al. used the GO annotation and 
their chromosomal locations to predict genes that may con- 
tribute to Inherited disorders, 8 MedGene takes a broader view 
and includes all diseases and all possible gene-disease relation- 
ships. Furthermore, MedGene utilizes co-citation to indicate a 
relationship rather than GO annotation, which is limited to the 
subset of genes that have GO annotation. Our approach is 
complementary to that taken by Chaussabel and Sher, who 
used the frequency of co-cited terms to cluster genes into a 
hierarchy of gene-gene relationships, 6 

A unique aspect of this tool is the ability to assess the relative 
strengths of gene-disease relationships based on the frequency 
of both co-citation and single citation, This presupposes that 
most co-citations describe a positive association, often referred 
to as publication bias 15 and is supported by our observations 



that negative associations are rare (Supplemental Table 3: 
http://hipseq.med.harvard.edu/MedGene/pubtication/sJTa- 
ble 3.html), Of course, relationships established by frequency 
of co-citation do not necessarily represent a true biological link; 
however, it Is strong evidence to support a true relationship. 

Another important feature of MedGene is the implementa- 
tion of software filters that substantially reduced the error rate. 
We estimate that less than 10% of all associations were missed 
and at least 70% of even the weakest associations were real. 
For this study, all of the filters that we applied were general 
ones, e.g., expanding the list of all gene names to address the 
different syntax forms used by different journals, eliminating 
gene names that correspond to common English words, etc. 
The majority of the remaining search term ambiguities were 
idiosyncratic and difficult to identify systematically without 
causing a significant rise in false negatives. Alternative ap- 
proaches, such as the examination of the nearest neighbor 
terms, need to be considered to further reduce the false positive 
rate. 

It is not uncommon to see expression changes in micro- 
array experiments as small as 2-fold reported in the literature. 
Even when these expression changes are statistically significant, 
it is not always clear if they are biologically meaningful. When 
comparing expression levels of disease to normal tissue, one 
expects an enrichment of known disease-related genes to 
appear in the altered expression group. MedGene provided a 
unique opportunity to test this notion In the context of existing 
knowledge on a novel breast cancer micro-array dataset. For 
genes displaying a 5-fold change or less in tumors compared 
to normal, there was no evidence of a correlation between 
altered gene expression and a known role in the disease. This 
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Table 3. Genes with Large Expression Changes in ER- but 
Not in ER+ Breast Tumors 
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POPS 
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-16.2 


BPAGl 


-4.6 


-22.3 


PDZKI 


-1.1 


-36,8 
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-2.8 


-51.5 


MUC6 
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-64.9 


SERPINA5 


-1.0 


-83.1 


MBISl 


-1.6 


-85.9 


CA12 


2.4 
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Table 3. Med Gene identified a set of relatively understudied, yet highly 
expressed genes in ER negative, but not ER positive breast tumors. AU of 
these genes have either never been co-cited with breast cancer or have a 
weak association except those marked with an *. 



reflects the many genes whose role in breast cancer may not 
involve large changes In expression In sporadic tumors (e.g., 
BRCA1 and BRCA2) and genes whose modest changes in 
expression may be unrelated to the disease. Strikingly, among 
genes with a 10-fold change or more in expression level, there 
was a strong and significant correlation between expression 
level and a published role in the disease, providing the first 
global validation of the micro-array approach to identifying 
disease-specific genes. 

The results derived from MedGene have two implications. 
First, a careful hunt for corroborating evidence of a role in 
breast cancer should precede any further study of genes with 
less than 5-fold expression level changes. Second, any genes 
with 10-fold changes or more are likely to be related to breast 
cancer and warrant attention. It is likely that this threshold will 
change depending on the disease as well as the experiment. 

Interestingly, the observed correlation was only found among 
ER-positive tumors, not ER-negatlve. This may reflect a bias 
in the literature to study the more prevalent type of tumor in 
the population. Furthermore, this emphasizes that caution 
must be taken when interpreting experiments that may contain 
subpopulations that behave very differently. The MedGene 
approach identified a set of relatively understudied, yet highly 
expressed genes in ER-negative tumors that are worthy of 
further examination (Table 3). 



In conclusion, we have developed an automated method of 
summarizing and organizing the vast biomedical literature. To 
our knowledge, the resulting database is the most comprehen- 
sive and accurate of its kind. By generating a score that reflects 
the strength of the association, it provides an important tool 
for the rapid and flexible analysis of large datasets from various 
high-throughput screening experiments. Furthermore, it can 
be used for selecting subsets of genes for functional studies, 
for building disease-specific arrays, for looking at genes com- 
mon to multiple diseases and various other high-throughput 
applications. In the future, it will be possible to enhance the 
utility of the MedGene database by building links between 
genes and other MeSH terms as well as other biological 
processes and concepts, such as cell division and responses to 
small molecules. 
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The number of genes in the human 
genome is estimated at 50 000- 
100000. However, only a fraction of 
these/sfienes are expressed in any 
cell Moreover, the level of 
gene expression in cells may vary 
with time, physiological conditions 
and disease states. This differential 
gene expression is generally 
reflected by the different number of 
mRNA species expressed in a given 
cell (-15000 individual mRNA 
species per cell) at any time point, 
and changes in relative mRNA levels 
may have important implications in 
the development of pathological 
processes. Therefore, discovery of 
differentially expressed genes is 
essential for the understanding of 
the molecular mechanisms involved 
in normal and pathological states, as 
well as providing new insights for 
discovery of new molecular targets 
for pharmacological manipulation 
and drug development. Hence, a 
number of techniques have been 
developed to identify genes (with 
known or unknown sequences and 
functions) that are differentially 
expressed in disease states. For 
example, northern hybridization, 
RNase protection assay, quantitative 
reverse transcription and polym- 
erase chain reaction (RT-PCR) have 
been successfully utilized to identify 
discordantly expressed known genes, 
Other techniques, such as differen- 
tial hybridization and subtractive 
library screening, have been used 
successfully for the discovery of dif- 
ferentially expressed genes with 
known and/or unknown sequences. 
In the differential hybridization 
method, a cDNA library is first pre- 
pared and then screened using 
probes that are made from two 
different sources, for example nor- 
mal and diseased tissues. Subtractive 
library screening is carried out on 



the basis of the construction of a 
subtracted cDNA library from dif- 
ferent RNA sources, for example 
normal and diseased cells, of which 
theidentical mRNA species have 
been removed using hybridization 
methods. Although these two tech- 
niques have proved to be useful 
in the discovery of differentially 
expressed genes, they are technically 
difficult and labour intensive, and 
require large amounts of mRNA (see 
Box 1). 

Recently, a number of PCR-based 
methods to uncover differentially 
expressed genes have been devel- 
oped; these techniques include (1) 
mRNA differential display 1 , (2) RNA 
fingerprinting 2 and (3) arbitrarily 
primed PCR (Ref. 3). These PCR- 
based techniques provide some 
advantages over the conventional 
methods and have been used suc- 
cessfully for novel gene discovery. In 
particular, the mRNA differential 
display methodology has been 
adopted by a large number of lab- 
oratories as an important additional 
tool that has applications for both in 
vitro and in vivo test systems 1 - K An 
overall strategic approach using this 
method for drug discovery is out- 
lined in Fig. 1. 

Messenger RNA expression 

Messenger RNA is the product 
of gene expression that encodes for a 
specific protein. The levels of mRNA 
in the cell are generally reflected by 
transcriptional regulation. Follow- 
ing transcription, mRNA is 
'matured' by capping the 5'-end, 
adding the polyadenylation [poly(A)] 
at the 3'-end, and splicing the intron 
sequences in eukaryotic cells. Taking 
advantage of the polyadenylated tail 
present in most eukaryotic mRNA 
species, the mRNAs can be reverse- 
transcribed in the presence of 



anchored primers complementary to 
the 3'-end of mRNAs, such as the use 
of oligo(dT„) (where n = 12-18) 
primers. In the technique of mRNA 
differential display, a set of 3'- 
anchored primers, such as T, 2 MN 
where M = G, A or C and N -G, A, T 
or C, are used to prime the reverse 
transcription reactions. 

Methodology: mRNA 
differential display 

The method of mRNA differen- 
tial display consists of two basic 
steps: (1) reverse transcription (RT) 
using a set of 3'-anchored primers, 
and (2) PCR amplification of 
cDNA fragments using arbitrary 
(upstream) primers and anchored 
(downstream) primers (Fig. 2). 

For the RT reaction, total 
cellular RNA (DNase treated to 
eliminate the possibility of genomic 
DNA contamination) is reverse- 
transcribed to yield the first strand 
cDNA primed with T 12 MN oligo- 
nucleotides. This RT reaction en- 
ables all the mRNA species having 
a poly(A) tail to be reverse- 
transcribed. Typically, this RT re- 
action is divided into four sub- 
groups, each using a different 
T I2 MN primer with G, A, T or C at 
the last base of the 3'-end. Because a 
large number of mRNA species are 
present in a cell, the division of sub- 
groups for the RT allows a portion 
of the mRNA species to be dis- 
played, which will increase the 
resolution of cDNA species after 
amplification 1 . 

Amplification of all the cDNAs is 
carried out using an upstream arbi- 
trary primer and a downstream 
anchored primer (identical to the 
one used for the RT) in the presence 
of a radioactive nucleotide (Fig, 2), 
The upstream primer has been opti- 
mized to ten bases in length, con- 
taining approximately 50% of GC 
contents 1 . In addition, a relatively 
low annealing temperature (42°C) is 
also recommended for the PCR to 
allow some base mismatches so that 
a larger number of the amplified 
mRNA species can be obtained. 
Using these conditions of amplifi- 
cation, it has been estimated that at 
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Box 1. Comparison of mRNA differential display with subtractive library 
screening for novel gene discovery 



mRNA Differential display 

• Key technique is based on RT-PCR 

• Very sensitive to detect altered gene expression 

• Allows multiple comparison, and monitors both 
upregulated and downregulated genes 

• Relatively reliable to detect the differentially expressed 
genes; confirmation by other techniques is required 

• Rapid to identify a lead probe 



Subtractive library screening 

• Crucial step is subtractive library construction 

• Relatively insensitive, especially for those low 
abundance mRNAs 

• Usually compares only unidirectional change 

• Very reliable to detect the altered gene 
expression 

• Relatively slow and complicated 



least 30-40 upstream primers in 
combination with the downstream 
primers will be necessary to amplify 
every mRNA species present in a 
given cell 8 . 

The amplified cDNA fragments 
are resolved by electrophoresis 
and subjected to autoradiographic 
analysis. By taking advantage of 
mRNA differential display, mul- 
tiple samples can be amplified and 
compared in parallel. As such, dif- 
ferences in gene expression, either 
upregulated or downregulated, can 
be identified in specific experi- 
mental or pathological conditions 
or along temporal expression pat- 
terns. As shown in Figure 3, the 
differential display analysis was 
carried out using cellular RNAs 
isolated from lipopolysaccharide 
(LPS)-stimulated and -unstimu- 
lated rat aortic vessels g . 

Band recovery 

Following mRNA differential 
display, the bands of interest may 
be recovered by applying the 
following three steps: the DNA 
band is (1) excised from the dried 
sequencing gel, (2) isolated by 
extraction procedures, and (3) 
reamplified using the same sets of 
primers as in the original PCR 
(Ref. 1). The recovered DNA band 
can serve as a probe to confirm 
mRNA expression by means of 
northern blot analysis, and/or be 
subcloned into a vector for further 
analysis. 



Confirmation of the 
differentially expressed genes 

Confirmation of gene expression is 
one of the crucial steps following 
mRNA differential display, in as 
much as a large number of false- 
positive bands may be present on dif- 
ferential display. A variety of meth- 
ods to reduce false positives have 
been utilized in different laboratories; 
the most commonly used method is 



northern blot analysis. Dot blot, 
quantitative RT-PCR, RNase protec- 
tion assays and other methods have 
also been used. 

Using two methods, differential 
display and northern blot analyses, 
the significant upregulation of 
mRNA (LPS-7) in response to LPS 
stimulation in cultured aortic vessels 
has been confirmed (Figs 3 and 4; see 
Ref. 9). 
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Fig. 1 Overall strategy for discovery of a novel pharmacological target using mRNA differential display. The procedure 
{steps 1-6) using mRNA differential display is described in dotaii In this article. Southern blot analysis in step 4 is optional 
but feasible when using reamplified samples on a large scale. The procedure for this Southern analysis is similar to that 
for differential hybridization. The isolation of the full length cONA and its further characterization (steps 7-9] arc crucial in 
the identification of a novel pharmacolugical target (step 10). RT, reverse transcription; ds, double stranded. 
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Fig. 2. Outline of the mRNA differential display technique, a: reverse transcription (RT) is carried 
out using primers T, 2 MN (M » 6, A or C; N = G, A, T or C) in the presence of dNTP (deoxyribohucleo- 
side triphosphate) and reverse transcriptase. The T, 2 sequence of the primers can anneal to the 
poly(A) tail of any mRNA species, whereas the last two bases at the 3'-end of the primers (MN) 
determine the specificity of the RT. b: PCR amplification is carried out using the same 3'*anchored 
primers as in the RT reaction in combination with a particular upstream decamer of rsndom 
sequence. This reaction is usually labelled with radioactive dNTPs. The decamor will anneal to the 
cDNAs with the complementary sequences and the region between this sequence and the 
3'-anchored position will be amplified, 



Identification of the 
differentially expressed genes 

It is fundamental to identify the 
genes discovered by mRNA differ- 
ential display. This step relies on 
the DNA sequencing analysis of the 
recovered DNA band. Because the 
primers used for differential dis- 
play are short and cannot be used 
successfully for direct sequencing 
by standard protocols, the differen- 
tial displayed DNA fragments are 
typically subcloned into a vector 
prior to sequencing analysis 1 - 5 . 
Recently, direct sequencing of dif- 
ferential display PCR products 
became feasible (1) on the basis of 
the use of enlongated primers for 
direct differential display 1 *- 11 or (2) 
during the reamplification follow- 
ing original differential display 
method 9 . 



Using this sequence information, 
the identity of the differentially 
expressed genes can be determined 
by searching a database, such as 
GenBank. If the sequence represents 
an unknown sequence, a cDNA 
library can be screened using this 
DNA as a probe in order to obtain 
the full length cDNA clone. 

Advantages of mRNA 
differential display 

Compared with the conventional 
methods for the discovery of genes 
with altered expression in disease 
states, such as differential hybridiz- 
ation and subtractive library screen- 
ing, the mRNA differential display 
technique has several advantages 
(see Box 1): (1) simplicity in all key 
techniques (primarily RT-PCR); (2) 
sensitivity due to PCR amplification; 
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Fig. 3, Differential display of mRNA isolated 
from cultured rat aortic artery stimulated 
with LPS (lipopolysaccnaride). Specifically, 
an upstream primer (B'-GACCGCTOT-T} in 
combination with downstream primers, 
either T 1Z MG {left} Of T, 2 MA (right), were 
used for the amplification. PCR products 
Were resolved in an 8 m urea/6% poly- 
acrylamide DNA sequencing gel in the fol- 
lowing order; lane 1, unstimulated and lane 
2, stimulated aorta from spontaneously 
hypertensive rats; lane 3, unstimulated and 
lane 4, stimulated aorta from Wi&tar-Kyoto 
rats. The band(s) indicated with an arrow- 
head (designated as LPS-7) shows a marked 
induction in response to LPS stimulation on 
the differential displaying gel. 



(3) versatility in detecting genes 
that are either upregulated or 
downregulated under various con- 
ditions, and the ability to perform a 
side-by-side comparison of differ- 
ent samples; (4) rapidity in identify- 
ing a probe (a cDNA) and confirm- 
ing the results (e.g. northern blot); 
(5) small amounts of RNA required; 
and (6) reproducibility (the dis- 
played bands in general show at 
least 60-70% identity for different 
repeats). These characteristics ren- 
der this technique increasingly 
popular for the discovery of novel 
genes. 
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Limitations of mRNA 
differential display 

While the differential display 
technique has significant advan- 
tages, some disadvantages in using 
mRNA differential display must be 
acknowledged 12 . The major concerns 
are the high incidence of false posi- 
tives, and the labour-intensive 
nature of this procedure for large- 
scale screening. In addition, the 
cDNA fragments isolated by this 
method are typically small, and fre- 
quently located in the 3'-untrans- 
lated region. Therefore, in order to 
identify the differentially expressed 
gene, one may need to screen a 
cDNA library to isolate the full 
length cDNA clone* Moreover, in 
order to observe every differentially 
expressed gene in the mRNA popu- 
lation, at least 20-25 (and possibly 
up to 80, see Ref. 13) upstream 
primers in combination with down- 
stream anchored primers should be 
used (based upon theoretical calcu- 
lations)**. It is obvious that this tech- 
nique needs to be refined further in 
order to be efficiently and widely 
applied for large-scale searching of 
altered gene expression in different 
diseases or under different experi- 
mental conditions. 

Recently, significant improve- 
ments and modifications have been 
made to the method as originally 
described' in order to overcome 
some of the existing problems in this 
technique 14 , e.g. (1) emphasis has 
been placed on the importance of 
DNA-free RNA samples and multiple 
displays of samples; this will reduce 
the frequency of false positives 1 *; 
(2) longer primers are used, e.g. 
1S-20 mers, as in RNA-fingerprint- 
ing 2 ; this not only increases the 
reproducibility of differential dis- 
play, but also allows direct sequenc- 
ing after PCR amplification 1011 ; (3) 
the application of slot blot has been 
used to evaluate the bands identified 
after differential display 1 ", or the use 
of northern blot for affinity captur- 
ing of cDNAs (Ref. 17); these 
methods reduce the labour-intensive 
nature of this work for large scale 
screening. Furthermore, (4) the 
potential hazardous nature of 35 S as a 



radiolabel for differential display 
has been noted, and either 32 P or M P 
have been recommended as alterna- 
tive labels 18 '™ 

Concluding remarks 

Differential display of mRNA is 
one of the most flexible and com- 
prehensive methods available for 
the detection of differentially ex- 
pressed genes in the cell. Since its 
initial description, this technique 
has been established in many lab- 
oratories and applied successfully 
in the identification of genes using 
in vitro and in vivo systems. In addi- 
tion, other strategies aimed at dis- 
covering novel genes are emerging, 
such as methodology of serial 
analysis of gene expression 
(SAGE) 20 and representational dif- 
ference analysis (RDA) 21 . The appli- 
cation of mRNA differential dis- 
play, and other techniques, for the 
isolation of novel genes associated 
with disease processes will no 
doubt facilitate the discovery of 
novel therapeutic targets and/or 
will help to understand the mol- 
ecular mechanisms of disease. 
However, this is the first of many 
steps (Fig. 1) required in the discov- 
ery of a novel pharmacological 
target, especially given that the 
function of this factor is most likely 
unknown. Therefore, further action 
should be taken to characterize the 
functions of a particular gene of 
interest, including isolation of full 
length cDNA, expression of the 
gene product for functional study 
and target validation for the im- 
portance of this gene in disease 
processes. 
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Gene Families: The Taxonomy of Protein 
Paralogs and Chimeras 

Steven Henikoff,* Elizabeth A. Greene, Shmuel Pietrokovski, Peer Bork, 
Teresa K. Attwood, Leroy Hood 



Ancient duplications and rearrangements of protein-coding segments have resulted In 
complex gene family relationships. Duplications can be tandem or dispersed and can 
involve entire coding regions or modules that correspond to folded protein domains. As 
a result, gene products may acquire new specificities, altered recognition properties, or 
modified functions, Extreme proliferation of some families within an organism, perhaps 
at the expense of other families, may correspond to functional innovations during evo- 
lution. The underlying processes are still at work, and the large fraction of human and 
other genomes consisting of transposable elements may be a manifestation of the 
evolutionary benefits of genomic flexibility. 



it, the unknown backbone structure can 
be predicted with confidence. In the case 
of homeoboxes, the high level of inferred 
structural similarity has guided site-direct- 
ed modification of this DNA-binding do- 
main for homeoboxes other than the 
structural archetype, and this situation 
holds for —30% of known protein se- 
quences (7). 



Linnaeus introduced a universal classifica- 
tion system of living things that was able to 
organize the enormous complexity of bio- 
logical relationships. A universal gene clas- 
sification system presents a similar chal- 
lenge but with added complexity. If a single 
gene is likened to an individual, then the 
collection of genes sharing common ances- 
try, typically performing the same role in 
different organisms, would be analogous to a 
species. Genes that are related in this way 
are commonly referred to as "orthologs" ( / ). 
Higher levels of gene or protein classifica- 
tion, such as families, subfamilies, and su- 
perfamilies, create a hierarchy in molecular 
taxonomy (2). Just what constitutes gene 
classification criteria can be uncertain in 
practice. This situation is made much more 
uncertain by the existence of nonortholo- 
gous relationships. Multiple proteins result- 
ing from gene duplications within an organ- 
ism are termed "paralogs," Paralogous rela- 
tionships have been known for several de- 
cades: a-globin, p-globin, and myoglobin 
are classical examples of paralogs that arose 
from duplications of ancestral globin genes 
in the vertebrate lineage (3). In recent 
years, with the explosive increase in avail- 
able sequence data, we have become aware 
of the richness of paralogous relationships 
in all organisms. We now realize that pro- 
tein building blocks, or "modules, 1 ' have 
duplicated and evolved in complex ways 
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through a variety of gene-rearrangement 
mechanisms (4). As a result, composite pro- 
teins consisting of multiple modules ("chi- 
meras") constitute a large proportion of the 
protein complement of an organism. The 
complexity that results from so many 
paralogous and chimeric relationships pre- 
sents a daunting challenge for classification. 
Meeting the challenge unites sequence with 
biological information. 

Like taxa, which reflect common ances- 
try but can also be used to infer common 
function, gene families have been of tremen- 
dous importance for understanding gene and 
protein function. Nearly all biological disci- 
plines have profited from discoveries of fam- 
ily relationships. Such discoveries have re- 
emphasized die importance of model systems 
in biology. For example, die sequencing of 
Drosopkk Vltrabithorax and Antennapedia se- 
lector genes controlling segment identity de- 
lineated a shared homeobox module; this led 
to the discovery and intense study of related 
HOX genes in vertebrates and other organ- 
isms that are thought to play key roles in 
determining developmental fetes (5). This 
example illustrates an increasingly popular 
paradigm in molecular genetics: Rather than 
proceeding from a phenotype to the isolation 
of a new gene, an investigator begins with 
the sequence of a key gene and searches for 
homologous genes in an organism of interest, 
preferably by scrutinizing the sequence data- 
banks (6). Experimental data accumulated 
for the homologous (orthologous or paralo- 
gous) gene, when integrated with insights 
from gene family relationships, can acceler- 
ate our understanding of biological processes 
and our ability to rationally engineer genes. 

Not just functional, but also structural 
inferences made from protein sequence 
alignments have been valuable to biolo- 
gists. When a structure is known for one 
sequence, and another can be aligned with 



Motifs, Modules, and Chimeras 

The smallest sequence units of protein fam- 
ilies are termed "motifs," which are identi- 
fied as highly similar regions in alignments 
of protein segments (8). Motifs can be as 
simple as the hexamer repeat unit that 
forms a left-handed parallel p-helix found 
in uridine 5 '-diphosphate (UDP)^N-acet- 
lylglucosamine acyltransferase (9). Motifs 
are widely used to identify functional re- 
gions of proteins and, where they share 
common ancestry, are useful for family clas- 
sification. The CjHz zinc finger DNA- 
binding motif, which is illustrated in the 
accompanying chart, defines the largest 
known family. By virtue of forming a con- 
tiguous independently folded structure, the 
finger is itself a module, whose small size of 
21 to 26 amino acids is attributable to a zinc 
cation, which holds together two cysteine 
and two histidine residues from either end 
of the module. The larger homeobox mod- 
ule consists of a -~60-amino acid motif also 
involved in binding DNA. More typically, 
modules consist of multiple motifs, which 
form the structural core of proteins. Motifs 
contributing to a structural core can be 
widely separated within the primary se- 
quence, as illustrated by the "HIGH" and 
"KMSKS" motifs of the Class I aminoacyl 
tRNA synthetases, which are hundreds of 
amino acids apart (10), Enzyme active site 
residues, which are usually highly con- 
served, are often found within motifs. 

Motifs may reflect either common an- 
cestry or convergence from independent or- 
igins. In either case, identification of motifs 
can be important for drawing structural and 
functional inferences. For example, the 
common "P-loop" motif is present in nucle- 
otide-binding domains from families as di- 
verse as kinesin motor proteins and adeno- 
sine 5' -triphosphate (ATP)- binding cas- 
sette (ABC) transporters, which are depict- 
ed in the accompanying chart. Despite the 
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lack of a known structure for any ATP- 
binding Cassette, the presence of a P-loop 
predicts the site of ATP binding in the 
transporter complex. 

Modules are composed of single or mul- 
tiple motifs. As the fundamental units of 
protein structure and function, modules are 
most useful for protein classification. Mod- 
ules frequently display different connectiv- 
ity relationships (Fig. 1, A to F), as illus- 
trated by the kinesins and ABC transport- 
ers, The kinesin motor domain can be at 
either end of a polypeptide chain that in- 
cludes a coiled-coil region and a cargo do- 
main (11). ABC transporters are four-do- 
main proteins consisting of two unrelated 
modules, a pair of ATP-binding cassettes, 
and a pair of integral membrane modules, 
which can be connected in different ways 
(12) (Fig. 1C). 



Dispersal of Protein 
Building Blocks 

Family relationships evolve over long peri- 
ods of time by speciation and by sequence 
duplications fixed in genomes. Even the 
most recently evolved family relationships 
are still so ancient that the events that gave 
rise to paralogs and chimeras in modern 
genomes cannot be directly observed. How- 
ever, enough is known about genomic-rear> 
rangement mechanisms that some inferenc- 
es can be drawn. Chromosomes evolve by 
transposition of mobile elements; by gross 
rearrangements such as inversions, translo- 
cations, deletions, and duplications; by ho- 
mologous recombination; and by slippage of 
DNA polymerases during replication. It is 
likely that all of these mechanisms have 
contributed to the proliferation and dispers- 



al of protein building blocks. Modules 
present in larger proteins, including ho- 
meobox modules, might have dispersed by 
transposition. Tandemly repeated modules, 
including the C 2 H 2 zinc fingers and many 
examples of extracellular modules, most 
likely arose by recombinational mecha- 
nisms, such as unequal crossing-over and 
gene conversion (Fig. 1, A and E). 

Multiple eukaryotic biosynthetic en- 
zymes, especially those in the purine and 
pyrimidine pathways, are sometimes found 
together within a single polypeptide, unlike 
their separately encoded bacterial orthologs 
(13). For example, vertebrates have a mul- 
tienzyme polypeptide for GAR synthetase, 
AIR synthetase, and GAR transformylase 
(GARS-AIRS-GART) (14). In insects, the 
polypeptide appears as GARS-(AIRS) 2 - 
GART; in yeast, GARS-AIRS is encoded 
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Fig. 1, Schematic representations of various building block arrangements 
described in the text. (A) Simple building blocks in DNA-blnding proteins. 
The human ZFY protein contains 13 tandemly repeated zinc finger mod- 
ules, and the Drosophila paired protein contains a paired box and a 
homeobox. (B) Subfamily relationships as predictors of quaternary struc- 
ture: dimeric kinesin heavy chain (KHC) and tetrarneric BimC protein com- 
plexes, (C) ABC transporters display different connectivities of two subunit 
pairs. Other examples of circular permutation have been recently reviewed 



1 M Secreted 



{54). (D) Organism-specific fusion and duplication of purine biosynthetic 
pathway orthologs to GARS, AIRS, and GART. (E) Diverse modules are 
found in the extracellular portion of protein tyrosine kinases. (F) Humans 
are polymorphic for duplications and deletions within the opsin tandem 
cluster of long-wavelength genes. (G) T cell receptor (TCR) genes are 
interrupted by clusters of 0 -trypsinogen gones. (H) Alternative processing 
produces membrane-bound, secreted or intracellular forms of antibodies 
(or both), and acetylcholinesterases. 
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separately from GART; and in bacteria, 
GARS, AIRS, and GART are all encoded 
separately (Fig. ID). The sites of fusion may 
correspond to introns, suggesting that chro- 
mosomal reanangemcnts have fused tran- 
scription units within introns. In other cas- 
es, fusions might have occurred in exons, or 
intron loss might have erased evidence of 
intron-mediated fusion (15), Regardless of 
mechanism, the fusion of transcription 
units is likely to have contributed to com- 
bining of protein building blocks in both 
eukaryotes and prokaryotes. 

The mechanisms that gave rise to the 
dispersal of paralogous proteins within ge- 
nomes are also diverse and frequently un- 
certain. The rhodopsin-like guanosine 5'- 
triphosphate (GTP)-binding protein (G 
protein )-coupled receptors illustrate mul- 
tiple dispersal patterns (/6). This family 
includes hormone, neurotransmitter, 
light, and olfactory receptors that are dis- 
tinguished from one another by both se- 
quence and functional differences. Re- 
markably, there are several hundred hu- 
man olfactory receptor (OR) genes present 
in a dozen or so tandem clusters on several 
chromosomes (17). A cluster of three OR 
genes and an OR pseudogene fused to a 
different OR gene is thought to have aris- 
en from disparate events, including recom- 
binations between repeats flanking OR 
genes and a fusion by nonhomologous de- 
letion (18). 

Tandem gene clusters are sometimes in- 
terrupted by paralogous members of other 
gene families. For example, intercalated be- 
tween repeated coding elements of the hu- 
man p T cell receptor (TCR) locus are five 
trypsinogen genes in inverted orientation 
(19) (Fig. IF). This complex arrangement 
of genes is likely to be of functional signif- 
icance, as it is also found in mice and 
chickens. 

Many paralogous relationships might 
be the consequence of whole-genome 
duplications. Ancient tetraploidization 
events in eukaryotes have been obscured 
by subsequent divergence, interchromo- 
somal duplications, and other rearrange- 
ments but can be detected by careful anal- 
ysis of genomic sequence. For example, it 
has been proposed that the Saccharomyces 
genome underwent a whole-genome dupli- 
cation, and that 13% of Saccharomyces 
cerevisiae genes trace their lineage to this 
event (20). Tetraploidization events are 
common among higher plants; for exam- 
ple, the wheat genome consists of three 
copies of an ancestral grass genome. The 
human genome is drought to be the prod- 
uct of multiple tetraploidization events 
that occurred during chordate evolution 
(5). As a result, we have four copies of 
many genes or gene families, including 



four HOX gene clusters comparable to a 
single set of HOX genes in invertebrates. 
Enough time has passed since these puta- 
tive tetraploidization events that verte- 
brate HOX genes have acquired distin- 
guishable functions. 

Selection for Diversity 

The acquisition of a new specificity or a 
modified function after a gene-duplication 
event is often detectable by protein se- 
quence comparison. For example, a-globins 
are more closely related to one another 
than they are to any pJ-globin, Maintenance 
of an acquired function over long evolu- 
tionary intervals can contribute greatly to 
the understanding of gene specificity. For 
example, sequence differences are sufficient 
to distinguish among tRNA synthetases 
that charge different amino acids, even 
though they belong to the same ancestral 
family (21). The kinesin motor domains 
provide another example, where relation- 
ships within a family are predictors for qua- 
ternary structural features: BimC motor do- 
mains are found in bipolar complexes, rath- 
er than in asymmetric complexes character- 
istic of other ktnesln motors (22) (Fig. IB). 
Comparisons should be interpreted with 
caution, especially when sequences from 
very distant organisms are compared; appar- 
ent subfamily relationships will not always 
reflect shared function. Furthermore, simi- 
lar functions can arise in separate subfami- 
lies. For example, among the ABC trans- 
porters, iron uptake is a function of mem- 
bers of two distinct subfamilies (23). 

Relatively recent duplication events are 
sometimes responsible for diversity in molec- 
ular recognition. Tandem duplication of im- 
munoglobulin (Ig) and TCR variable, join- 
ing, and diversity gene segments is the pro- 
totypical example, and special mechanisms 
of somatic DNA rearrangement and muta- 
tion further diversify antibody and TCR 
specificity. Among the rhodopsin-like G 
protein-coupled receptors, different olfacto- 
ry receptors are diought to recognize differ- 
ent odorants, and different opsins are stimu- 
lated by different wavelengths of light. Long- 
and short-wavelength opsin genes diverged 
from one another early in vertebrate evolu- 
tion (24). The opsins of the human visual 
system are present in a cluster on the X 
chromosome, with the long-wavelength 
opsins, sensitive to red and green light, con- 
stituting a tandem repeat with 98% sequence 
identity (Fig. IF). Remarkably, the number 
of long-wavelength genes is polymorphic, a 
consequence of unequal crossing-over events 
diat have occurred during human evolution. 
People with "normal" vision have a single 
red gene and one to three green genes. Peo- 
ple who are red-green colorblind have lost a 



long-wavelength gene through a fusion of 
red and green tandem copies. 

The products of gene duplication can 
act combinatorially and so further increase 
diversity. A response to a single antigen 
generally stimulates the proliferation of 
different B cells, each expressing a single 
antibody; the combination of different 
light and heavy chains provides height- 
ened specificity to antigen. For olfaction, 
the stimulation of multiple olfactory re- 
ceptors by their different odorants allows 
complex mixtures to be recognized. Our 
ability to recognize a full spectrum of col- 
ors with only three types of opsins is an- 
other example of the integration of mul- 
tiple sensory inputs that have originated 
from duplicated building blocks. 

Duplication of building blocks within a 
protein also results in generation of diversi- 
ty during evolution. Each C 2 H 2 zinc finger 
in a DNA-binding protein can recognize a 
3- base pair motif, and in combination, 
multiple zinc fingers can mediate the bind- 
ing to more complex DNA recognition sites 
(25). Combinatorial recognition by tandem 
zinc fingers has been exploited by research- 
ers for designing new DNA-binding pro- 
teins (26). Combinations of unrelated mod- 
ules have also broadened the spectrum of 
DNA-binding recognition, such as the pres- 
ence of a paired box and a homeobox mod- 
ule in proteins related to Drosophila paired 

(27) (Fig, 1A). Extracellular proteins are 
notable for containing combinations of 
multicopy tandem arrays of different mod- 
ules. The extracellular portion of the recep- 
tor tyrosine-specific class of protein kinases 
contains an astonishing variety of modules 
representing different families. For example, 
trJc-like kinases have one kringle and four Ig 
modules, whereas te/c-related proteins have 
three fibronectin III, diree epidennal 
growth factor (EGF), and two Ig modules in 
their extracellular NH Z - terminal portions 

(28) (Fig. IE). These extracellular modules 
can acquire diverse functions in different 
proteins. For example, some EGF modules 
bind to specific receptors, whereas others 
mediate interactions through calcium bind- 
ing; the latter sometimes form long, rodlikc 
structures composed of tandem module ar- 
rays (29). 

Unlike germ-line processes that recom- 
bine gene segments during evolution, alter- 
native messenger RNA (mRNA) processing 
can increase the diversity of proteins in the 
soma. For example, an alternative polyade- 
nylation site within an intron of the Ig 
heavy-chain gene allows a switch from the 
synthesis of a membrane-bound receptor to a 
secreted antibody (30) (Fig. 1H). Acetylcho- 
linesterase provides an example of alterna- 
tive 3' splice site selection accomplishing a 
comparable task; the choice of one terminal 
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exon leads to the synthesis of a glycophos- 
pholipid membrane anchor, the choice of 
the other to a cytoplasmic form, and lack of 
splicing to a secreted form of the enzyme 
(31). 

Why Are Some Families 
So Large? 

The accompanying chart provides informa- 
tion on the distribution of selected building 
blocks in model organisms, For organisms 
with completely determined genomic se- 
quences, we can ask why some families are 
more successful than others. In Escherichia 
co/i, the ABC transporters are the most 
common proteins encoded; this might re- 
flect a flexible diet, which requires the up- 
take of diverse nutrients (12). It is likely 
that the much smaller number of ABC 
transporters in Mycoplasma genitalium and 
Methanococcus jannaschii reflect more limit- 
ed diets. In general, paralogs account for 
half of all E. coli genes (32), which is high 
compared to the fractions found for smaller 
bacterial genomes, such as Haemophilus m- 
fluenztie, where one-third of all genes are 
paralogs (33, 34). Much of this difference is 
attributable to the more diverse nutritional 
and metabolic requirements of E. coU {34). 

For organisms that have not yet been 
fully sequenced, it is necessary to extrapo- 
late from samples of available sequences. 
For example, on the basis of finding only 
eight homeobox genes in S. cerevisiae, ex- 
trapolation predicts about 20 each in flies 
and worms, which are estimated to have 
two to three times as many genes (see ac- 
companying chart). The fact that there are 
already about 60 genes reported in each of 
these two complex multicellular organisms 
demonstrates that homeobox genes have 
more successfully proliferated in animals 
than in a yeast. Although the number from 



Drosophik melanogaster is based on only 
"-10% of its genome, we predict that most 
of its homeobox genes have already been 
identified, and the final number will not be 
much greater than the number in Caeno- 
rhahdids elegans (which has nearly the same 
sized genome, —70% of which is already 
sequenced). Such disproportionate repre- 
sentation of particular families is both a 
manifestation of their intense interest to 
researchers and of the ability to obtain 
these members by hybridization and ampli- 
fication methods. Not all modules are as 
amenable to this approach as are the ho- 
meoboxes, which are especially highly con- 
served; to an increasing extent, partial com- 
plementary DNA (cDNA) sequencing 
projects are being used to identify coding 
sequences for gene families of interest (35). 
Many other gene families, such as the glo- 
bins and the immunoglobulins, are dispro- 
portionately represented in collections of 
human sequences because they are impor- 
tant for human health (Table J ). 

Even for the whole-genomic sequences 
that are currently available, the final size of 
known families is uncertain. Distant ho- 
mologs may lie just beyond the horizon of 
current homology-detection methods. How- 
ever, the introduction of improved method- 
ology continues unabated, and this has led to 
the discovery of new family members and 
interfamily relationships. Moreover, the in- 
creasing size of a family can be exploited by 
multiple sequence-based methods to identify 
additional members (36). For example, 12 
years ago, the similarity between opsin genes 
from human and fly was barely at the level of 
detection (37), yet today, the opsins are 
recognized as a closely related cluster widiin 
the rhodopsin-like G protein-coupled recep- 
tors (see accompanying chart). Most impor- 
tantly, the accumulation of experimental ev- 
idence concerning gene or protein function 



Table 1. The largest protein families. "Hie sources for these numbers of modules are Ram (PF) or Prints 
(PR). GPCR, G protein-coupled receptor; LDL, low density lipoprotein. 



Family 



Source 



Modules in 
SwissProt 



Found where? 



GJtip zinc fingers 
Immunoglobulin module 
Protein (Ser/Thr/Tyr) kinases 
EGF-like domain 
EF-hand (Ca binding) 
Globins 

GPCR-rhodopsin 
Fibronectin type HI 
Chymotrypsins 
Homeodomain 
ABC cassette 
Sushi domain 
RNA-blnding domain 
Ankrln repeat 
RuBisCo large subunit 
LDL receptor A 



PF00096 
PF00047 
PF00069 
PF00008 
PF00036 
PF00042 
PF00001 
PF00041 
PR00722 
PF00046 
PF00005 
PF00084 
PF00076 
PF00023 
PF00016 
PF00057 



1826 Eukaryotes, archaea 

1351 Animals 

928 All kingdoms 

854 Animals 

790 Animals 

699 Eukaryotes, bacteria 

597 Animals 

514 Eukaryotes, bacteria 

464 Eukaryotes, bacteria 

453 Eukaryotes 

373 All kingdoms 

343 Animals 

331 Eukaryotes 

330 Eukaryotes 

319 Plants, bacteria 

309 Animals 



or protein structure will provide insights that 
can be used to deduce possible family rela- 
tionships that would not be compelling by 
sequence comparison methods alone, 

Phylogenetic Distribution 
of Families 

Size of a family within an organism is only 
one measure of success. Another is presence 
of a family in diverse organisms. Some fam- 
ilies are successful at both, such as the ABC 
transporter family, which is not only one of 
the largest families overall (Table 1), but 
also appears to be present in all organisms. 
Most other families that are so widely dis- 
tributed show much less proliferation with- 
in organisms. These include metabolic en- 
zymes and components of the translational 
apparatus, which have only a few close 
paralogs (38). These families show a similar 
distribution to that of the GARS module in 
the table of the accompanying chart (39). 

The chymotrypsin family of serine pro- 
teases is notable in being both ancient and 
large (Table 1), but the extreme prolifera- 
tion appears to be confined to eukaryotes; 
only rarely are family members found in 
bacteria. This raises the possibility that oth- 
er families that appear to be confined to 
certain branches of the tree of life are ac- 
tually more ancient, but that they have 
simply become extinct in other lineages, or 
that a relationship has gone undetected. 
The latter is the case for eukaryotic tubulin 
and bacterial FtsZ, both of which use GTP 
for polymerization to form similar intracel- 
lular fibers and are believed to be ances- 
trally related (40), This relationship was 
not detected by pairwisc sequence compar- 
isons, but rather by recognition of a tubulin 
motif in FtsZ. Potentially homologous pro- 
teins have also been identified by structure 
determination, such as the detection of sim- 
ilar folds for kinesin and myosin motor pro- 
teins (41). 

Given the extreme uncertainty in trac- 
ing the birth of a family, we nevertheless 
recognize that some families have prolif- 
erated to a remarkable extent in certain 
phyla. GAL4 transcriptional regulators, 
one of the largest families in yeast, have 
been found only in fungi (see accompany- 
ing chart). The EGF module, present in 
about 1% of human proteins, has been 
described only in animals (Table 1). The 
Ig module, which is found in more than 
200 proteins in addition to all of the im- 
mune receptors (antibodies, TCRs, class I 
and II families of the major histocompat- 
ibility complex), is involved in diverse cell 
surface recognition phenomena in multi- 
cellular organisms (42). The Ig module has 
also successfully proliferated within pro- 
teins: A total of 244 copies of Ig and 
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distantly related fibronectin III modules 
account for most of the 30,000-residue 
muscle titin protein (43). The success of 
the ~100-amino acid Ig module is attrib- 
utable to its potential to undergo diversi- 
fication in the presence of a highly con- 
served structural framework, its protease 
resistance in the folded form, and its 
ability to readily form homo- and het> 
erodimers through multiple interacting 
surfaces, so that it is especially suitable for 
mediating cell-cell interactions. 

Proliferation of one family might have 
occurred at the expense of others. The dis- 
tribution of protein kinases is suggestive* in 
that the family consisting of serine-, threo- 
nine-, and tyrosine-specific enzymes is 
hugely successful only in eukaryotes, but is 
poorly represented in bacteria (see accom- 
panying chart). Conversely, the family of 
histidinc-spccific protein kinases is highly 
successful in E. coli and other bacteria, but 
is relatively rare in eukaryotes. In such sit- 
uations, we must also consider the possibil- 
ity that these families are recent arrivals in 
some organisms, having been transferred 
horizontally between kingdoms. Horizontal 
transfers are difficult to document unless 
there are conspicuous anomalies evident 
from molecular phylogenetic analyses. Such 
anomalies have indicated numerous hori- 
zontal transfers of mariner transposases be- 
tween diverse animals (44), as well as trans- 
fer of the fibronectin III module from a 
eukaryote to a bacterium (45). 

The establishment, proliferation, or ex- 
tinction of a protein family in a lineage may 
coincide with a functional innovation dur- 
ing evolution. For example, actins, tubulins, 
and motors such as kinesins are found only 
where there is a cytoskeleton, as though the 
evolution of these proteins was coordinate 
with the appearance of die cytoskeleton in 
eukaryotes. In bacteria, a factors regulate 
transcriptional initiation, in contrast to eu- 
karyotes and archaea, which use a different 
system (46). This difference suggests that 
either the o* factor system coincided with 
the appearance of bacteria or that it was lost 
in the eukaryotic-archaea lineage. 

Interspersed Genomewide 
Repeats 

Analysis of whole-genomic sequences defin- 
itively demonstrates that coding regions of 
genes dominate the prokaryotic genome 
(38), In contrast, complex eukaryotic ge- 
nomes are dominated by noncoding se- 
quences. Families of repeats derived from 
transposable elements constitute a major 
portion of these eukaryotic genomes, far ex- 
ceeding exons in the proportion of the ge- 
nome devoted to them (47, 48). Transposi- 
tion can occur by reverse transcription of an 



Table 2. Content of long contguous stretches of DNA sequence in selected human and mouse gene 
regions. Data are from the Leroy Hood laboratory. 



Region 


Contlg 


GC 


mRNA 


Interspersed 


Linel 


Alu orB1/B2 


length (bp) 


(%) 


(%) 


repeats (%) 


(%) 


(SINES) (%) 


Human TCRa 


1,071,650 


40 


4.0 


35 


16 


8 


Mouse TCRa 


228,654 


41 


1.5 


33 


22 


2.4 


Human TCRp 


684,973 


42 


4.6 


30 


14 


5 


Human TCR on chromosome 9 


216,293 


41 


1.7 


45 


23 


9 


Mouse TCRp 


700,960 


40 


3.8 


43 


32 


2 


Human MHC class III 


299,287 


52 


16,8 


30.5 


6.7 


17 



RNA intermediate or by excision and rein- 
tegration of DNA itself (DNA transposi- 
tion). These elements fall into four catego- 
ries: short interspersed nuclear elements 
(SINEs), long dispersed nuclear elements 
(LINEs), long-terminal repeat (LTR) retro- 
virus-like elements, and DNA transposons 
(Fig. 2). In the human, there are -1,100,000 
Alu sequences (a SINE) and 590,000 Linel 
sequences (a LINE). It is impressive that 
Linel occupies an order of magnitude more 
of our genome than all of our gene-coding 
sequences combined. Furthermore, with im- 
proved techniques for identifying degraded 
repeat sequences, perhaps 50% of our ge- 
nome and an even higher fraction of the 
mouse genome will be found to consist of 
genomewide repeats. Much of the nonas- 
signed genome sequences might be com- 
posed of interspersed repeats degraded to the 
point that they are no longer recognizable. 

Vertebrate chromosomes have large- 
scale mosaic structures, or isochores, often 
with distinct ratios of G+C nucleotides, 
repeat content, and gene density (49). The 
human contigs in Table 2 represent high 
(class II major histocompatibility locus)-, 
medium (TCR)-, and low (metabolic glu- 
tamate receptor 8)-gene density regions. 
Low-gene density loci are A+T- and 
Linel-rich, whereas high-gene density loci 
are G+C- and Alu-rich (47, 49). The 
A+T-rich isochores, in general, contain 
longer genes. 

The repeats may have at least three 
important functional and evolutionary 
roles. First, some may evolve to become 



the regulatory regions of genes expressed 
in a tissue-specific manner (50). Second, 
repeats play an important role in refash- 
ioning the genomic architecture by facili- 
tating homologous recombination, trans- 
locations, and perhaps gene conversions. 
And third, repeats have been implicated 
in epigenetic phenomena, such as parental 
imprinting and position-effect variegation 
(51). Because the ages of repeats can be 
determined by species comparisons, they 
can serve as valuable time markers for 
unraveling the complexities of molecular 
archaeology in complex gene loci such as 
the TCR genes. 

Prospects 

There is good news and bad news for gene 
taxonomists- The good news is that the 
number of identified protein families has 
been increasing only slowly with the rapid 
increase in new sequence data and is ex- 
pected to level off. The bad news is that 
family relationships are so complex that 
we cannot use any simple hierarchical 
scheme to make the data easily under- 
standable. Nevertheless, as more is learned 
from model organisms about individual 
modules, their presence in any protein of 
interest adds potential insight into its 
function and guides experiments, which is 
good news for biologists. Gene taxono- 
mists have learned by now to cope with 
complexity in family relationships, and 
currently several classification systems are 
used to construct the different databases 



Fig. 2. Schematic represen- 
tation of the types of trans- 
posable elements that have 
produced high-copy number 
human Interspersed repeats. 
The shaded boxes denote in- 
ternal promoter sites; names 
inside the bracket indicate 
that only autonomous ele^ 
ments code for these pro- 
teins. LTR, long-terminal re- 
peat; ITR, inverted-terminal 
repeat; RT, reverse transcrip- 
tase. [Adapted from (47) on 
the basis of 7051 kb of human sequence] 
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listed in the accompanying chart. In fact, 
the task of classification is made easier for 
gene taxonomists than for Linnaean tax- 
onomists because sequence similarity is a 
precisely defined metric for establishing 
relatedness. This metric makes possible 
automated and computer-assisted classifi- 
cations of genes. Much more difficult is 
the task of enriching the databases of 
genes and families with insights obtained 
from experiments. 

To some extent, computer-based tools 
can be applied to the task of connecting 
genes and families with information about 
them. Organism-specific databases and re- 
trieval tools such as the National Center for 
Biotechnology Information's Entrez allow 
biologists to rapidly obtain needed informa- 
tion from the World Wide Web. However, 
insight cannot be automated, and comput- 
er-based tools that go beyond sophisticated 
retrieval methods may not be the solution. 
One problem is that generalized databases 
are too constraining to allow more than 
minimal documentation of individual pro- 
tein families. Another problem is that the 
literature pertaining to a single family can 
be so vast that only an expert devoted to 
that family can master it. Fortunately, a 
number of biologists interested in particular 
families have begun to exploit the Web to 
provide the kind of rich information that 
can be used to gain insight into function, 
At a single family Web site* participation 
can be distributed among multiple labora- 
tories, and information can be continually 
updated and integrated (52). Furthermore, 
new Web sites are developed on the basis of 
existing sites. There are currently five Web 
sites dedicated to different nuclear hormone 
receptors spawned from the Nuclear Recep- 
tor Resource, and the Myosin Web site was 
spawned from the Kinesin Web site (53). 
An organized effort to develop such sites is 
in progress (see http://proweb.org for infor- 
mation on participating). 

We have focused here and in the accom- 
panying chart primarily on large and well- 
studied families. But to truly understand a 
biological system, we will need to understand 
the interaction of all individual components. 
Some of these components will not be im- 
mediately classifiable. Eventually, detectable 
homologs for most of these "orphans" will be 
discovered in genome-sequencing projects. 
As a result, new family relationships will 
become delineated that are useful for identi- 
fying critical regions and guiding experimen- 



tal work. This situation is most evident in an 
organism such as M. jatmaschii, for which a 
large fraction of proteins are as yet unclassi- 
fied orphans, but to a lesser extent it is true 
for all major phyla. The identification and 
classification of new protein families and the 
deep insights that result should continue 
well into the next millennium. 
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